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ABSTRACT 


The use of multiple-choice test items involving 
one choice out of four or five alternatives has been so 
extensive that many test writers and users now employ no 
other form of test. There are, however, two major disadvan- 
tages inherent in the conventional one-or-zero scoring system. 
The first of these is that this method is unable to dis- 
criminate between partial information, complete information, 
and no information. The second undesirable feature is the 
encouragement of guessing. 

In the present study an attempt was made to improve 
upon current techniques of administering and scoring the 
multiple-choice test, or, more specifically, to increase 
the reliability and validity of such tests over what it is 
under conventional administration and scoring. 

Three test-taking methods were used in this study: 
Conventional Testing, Confidence Testing, and Elimination 
Testing. Four different scoring techniques were used along 
with these test-taking methods. Conventional Scoring was 
used with Conventional Testing and Elimination Testing; 
Differential Weighted Scoring, with Conventional Testing; 
Confidence Weighted Scoring, with Confidence Testing; and 
Elimination Scoring, with Elimination Testing. The Confi- 
dence Weighted Scoring was done by use of five different 
scoring functions, three of which were introduced in this 
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study. 

There were two aptitude tests--vocabulary and 
mathematics--used with these experimental methods. Two 
types of criterion scores were obtained: school achieve- 
ment and aptitude test scores from similar forms of the 
vocabulary and mathematics aptitude tests. The aptitude 
tests as criteria were administered and scored using 
conventional procedures. 

Subjects were 1028 grade nine students randomly 
assigned to three groups of comparable size. Each group 
was given the tests under one of three test-taking methods. 
For each group, there were two test sessions. In the first 
session two tests were given--vocabulary and mathematics. 
In the second, the same two tests were given again, and two 
other aptitude tests were administered, using the conventional 
treatment, to obtain test scores for use as criteria. School 
achievement scores were obtained from school records. 

It was found that two scoring functions employed 
with the Confidence test-taking method provided scores more 
reliable than did the conventional method with either Conven- 
tional or Differential Weighted Scoring. These two functions 
are based on the increasing increment scoring model. Both 
the functions and the scoring model were introduced in this 
study. It was found, however, that none of the experimental 
scoring techniques provided test scores more valid than 


provided by the conventional test-taking and scoring methods. 
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Since validity is generally the most important 
test characteristic, the findings indicate that the experi- 
mental test-taking and scoring approaches may not be worth 
the effort. Discussion of the findings, in relation to the 
results of some previous studies and to theoretical impli- 
cations, are given. It is suggested that, in order to solve 
the shortcomings inherent in the use of the conventional 
test-taking and scoring procedures, investigators might be 
wise to pursue other leads than those examined in the 


present study. 
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CHAPTER I 
BACKGROUND OF THE STUDY 


Introduction to the Problem 

The use of multiple-choice test items involving one 
choice out of four or five proffered alternatives has been 
so extensive that many test writers now use no other form 
(Hendrickson, 1971). But the general acceptance of the 
multiple-choice form of test item as the best one for 
objective measurement of aptitude or achievement does not 
imply that it has reached its optimal form. Any variation 
upon an already widely accepted technique which indicates 
promise of improved measurement is deserving of further 
investigation (Coombs, Milholland, & Womer, 1956). 

During the last two decades or more there has been 
an increasing interest in looking for new methods of scoring 
this form of test in place of the conventional one-or-zero 
method. There have been a number of suggestions for new 
complex scoring patterns for objective tests, particularly 
multiple-choice ones (Thorndike & Hagen, 1969, p. 123). 

All these ideas stemmed from the fact that test writers and 
users, at present, are still unsatisfied with the conventional 
scoring method commonly employed. Among the several dis- 
advantages of the conventional procedure, the most seemingly 
crucial points are: (1) the inability to discriminate between 
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partial information and complete information or misinformation, 
and (2) the encouragement of guessing (Coombs ethliads, £956)" 
Accordingly, investigators have spent much time and effort 
finding out whether there are other reliable techniques to 
replace the old one, hoping that the new techniques might 
provide the possibility of eliminating these disadvantages 
and result in increased test reliability and validity. 
Scoring formulas that assess partial knowledge of examinees 
have been proposed and used in many studies. The conclusion 
of these studies is that partial knowledge does exist and 
that by employing proper scoring techniques the reliability 
of multiple-choice tests may be increased (Coombs et al., 
1956; Sabers & White, 1969). 

Among several alternatives in studies to assess 
partial knowledge, two kinds of techniques are typically 
employed: those that (1) differentially weight the response 
alternatives, and (2) require examinees to report their con- 
fidence in the correctness of response alternatives (Hambleton, 
Roberts, & Traub, 1970). For the first method, some studies 
carried out in comparison with the conventional method have 
reported that there is a certain amount of increment in test 
reliability and also test validity (Davis & Fifer, 1959; 
Dressel & Schmid, 1953; Hambleton et al., 1970; Rippey, 
1970; Sabers & White, 1969). However, because the 
increments they found were sometimes small and sometimes 
inconsistent, there are still no sound conclusions that can 


be drawn from those works with respect to the applicability 
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of the new technique. Several investigators are still 
optimistic about the problem and have suggested that further 
Studies be employed (Hambleton et al., 1970; Sabers & White, 
1969; Stanley & Wang, 1970). 

More studies involving the use of confidence 
weighting have been reported than those using the differ- 
ential weighting technique. Two reasons which seem to 
contribute to the evidence are the promising results of 
the technique found in earlier works and the variety of 
formulas suggested for assigning the confidence level. Among 
those studies reported, most show an increase of test reli- 
ability and some an increase in test validity, although many 
of them could not find any consistency in the increment 
(Dressel & Schmid, 1953; Ebel, 1965; Hopkins, Hakstian, & 
Hopkins, 1973; Michael, 1968). At present there are still no 
definitive conclusions about the contribution of the new 
techniques to testing. However, although these investigators 
could not formulate definitive conclusions from their works, 
they did recommend ways of improving future studies 
(Hambleton et al., 1970). 

Another technique suggested in the literature which 
is supposed to assess partial knowledge is the method of 
elimination (Coombs et al., 1956). Although’ this technique 
is considered as one of the confidence weighting methods 
(Wang & Stanley, 1970), the procedure seems rather 
distinctive. In this method subjects are instructed to 


eliminate incorrect alternatives, taking care not to mark 
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the correct one. Item score depends on (1) option weights, 
+1 for each incorrect alternative crossed out and -3 for the 
correct alternative crossed out in a four-choice item, and 
(2) the examinee's confidence level as shown by the number 

of alternatives he crossed out. There is one study by 

Coombs et al.(1956) that used this technique and its results 
indicated a potentially promising contribution to the testing 
field. 

Because of the dissatisfaction with the disadvantages 
of the conventional scoring and the promising results from 
the new techniques reported in various studies, investigators 
have tried to approach the problem with different and improved 
designs to obtain more precise results. It was generally 
hoped that if these studies proved the superiority of the new 
techniques over the old one there would be a radical change 
in testing practices. None of the many studies reported in 
the literature, however, has demonstrated definitely the 
advantages of a new technique over the old one. The problem 
still exists and awaits solution. 

It should be noted that there is at least one study 
that comes very close to providing some sort of definitive 
conclusions. This work was done by Hambleton et al. (1970). 
They used both differential weighting and confidence weighting 
techniques in comparison with the conventional technique in 
the same study. The main purpose was to see whether signifi- 
cant increments in both test reliability and validity could 


be found. Because of the fact that the number of subjects 
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was small and the test was rather easy for the subjects' 
level, they failed to establish the expected conclusion. 
However, the failure did not discourage them from showing 
enthusiastic hope and optimism about the new techniques. 

The most recent work reported in the literature on 
this problem was done by Hopkins et al. (1973). This study 
was concentrated on the confidence weighting. The investi- 
gators tried to improve on the procedures previously employed. 
They were successful in showing an increment in test reli- 
ability with confidence weighting but failed to find any 
consistent evidence for increased test validity. This work 
should not be regarded as providing closure to the problem, 
because of the different design and method of scoring they 
used in their study. | 

It can be concluded that all studies reported thus 
far, although worth the effort and contributing to the solution 
of the problem, are still not sufficient to make final defin- 
itive conclusions. Further more comprehensive studies are 
required before sound conclusions can be made. 

With the ideas discussed above and an optimistic 
view about the new techniques, the author decided to solve 
the problem with a more comprehensive experimental design. 
Statement of the Problem and Implications for 

Education 

As many studies have reported, the basic objective 

of the investigation is to discover a new method to score 


multiple-choice test items that can eliminate two crucial 
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disadvantages resulting from the use of the conventional 
method: (1) the inability to assess partial knowledge, and 
(2) the encouragement of guessing. Theoretically, it is 
widely agreed that these new techniques can eliminate both 
disadvantages. But even though they can, it does not imply 
that these techniques are ready to be used. Improvements 
in test reliability and validity must be demonstrated, and 
empirical evidence is required in this regard. 

In this study, effort was made to assess whether 
improvement in reliability and validity resulted from the 
use of different test-taking and scoring methods. More 
procedures were assessed, and a larger sample was used than 
in earlier work. It was hoped that, by utilizing a more 
comprehensive design, more definitive results would be 
obtained. 

The above discussion leads to the main implication 
for educational practice, that if this study showed signifi- 
cant improvements in test reliability and validity for any 
of the techniques compared, the results would be significant 
to the multitudes of test users. With the already accepted 
ability of these techniques to assess partial information 
and eliminate guessing, and with demonstrated improved 
reliability and validity of test scores, the methods should 


be found attractive by test users. 


Limitation of the Study 


This study was confined to grade nine junior high 
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7 
school students. Therefore, the results are representative 
of subjects at this school level only, and should be cautiously 
generalized to higher or lower levels of education, i.e., 
elementary school, senior high school, or college. 

The tests used in this study were aptitude tests; 
verbal comprehension and arithmetic reasoning, and the criteria 
were school achievement scores at the end of the first term. 
So inference should, strictly speaking, be confined to these 
predictors and criteria. 

No attempt was made to compare experimental and con- 
ventional methods with respect to administration time, levels 
of test difficulty, levels of intelligence, sexes, or any 


other variables not previously discussed. 


Research Hypotheses 

The ultimate goal of this study was to examine 
differences in test reliability and validity using the new 
testing and scoring methods, using as a baseline results 
obtained by administering and scoring tests under conventional 
procedures. It was expected that test scores obtained under 
either experimental method would contain more information 
about the state of each examinee's knowledge than the 
conventional one since they allow partial information to 
be taken into account. 

From the discussed rationale, this study was carried 
out to examine the following hypotheses: 

(1) Test scores obtained under any of the experimental 


methods are more reliable than those obtained under the 
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conventional method. Evidence examined would include both 
the internal consistency of test scores and the stability 
of test scores over a period of time. 

(2) Test scores obtained under any of the experimental 
methods have higher validity for outside criteria measuring 
either (a) the same or similar traits or (b) school achieve- 


ment, than those obtained under the conventional method. 


Theoretical framework 
(1) Definition of Terms 
(a) Testing Methods. 

Conventional Testing (CV Method). This is the 
test-taking procedure most widely used at present. The 
procedure is to encourage examinees to read a question and 
then choose the one of four or five response alternatives 
that they judge to be the correct answer. 

In this study subjects were encouraged to answer 
all items. However, no explicit instructions to guess were 
given. 

Confidence Testing (CF Method). The technique was 
a modified version of that used by Michael (1968), and 
Hambleton et al. (1970). In this technique, each subject 
was instructed to distribute 10 points of confidence among 
five response alternatives for an item according to the 
correctness he thought each one had. 

Elimination Testing (EL Method). This test-taking 


method was developed and used by Coombs et al. (1956). In 
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9 
this method a subject is asked to select and mark incorrect 
answers out of five given response alternatives for an item. 
He can mark one or two up to four given answers leaving one 
that he thinks is correct. The item score depends on the 
number of choices he marks and whether he marks the correct 
answer or not. 

(b) Scoring Methods. 

Conventional Scoring (CVS Method). This scoring 
method was applied to answers under both Conventional 
Testing and Elimination Testing. Under the Conventional 
Testing, a correct answer would score one, an incorrect 
answer, including omissions, zero. Under the Elimination 
Testing format, a score of one was given to the answer with 
all four incorrect alternatives crossed out, otherwise, 
including omissions, zero was given. Thus an item score 
would be either one or zero. The total test score was the 


sum of the item scores. No correction for guessing was 


applied. 
Differential Weighted Scoring (DWS Method). This 
scoring method was used by Hambleton et al. (1970). In this 


study it was applied to answers under the Conventional Testing 
format. The technique was different from the Conventional 
Scoring in®that) *instéad'of scoring one’or zero, Differential 
Weighted Scoring assigned predetermined weights to all 
incorrect response alternatives according to the degree of 
correctness each one had, with a weight of five to the correct 


answer. The item score was the weight of an option chosen by 
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a subject and the total test score was the sum of the item 
scores. 

In this study, the weight of each response alternative 
was assigned on the basis of a pilot analysis using a group 
of students in the Department of Educational Psychology at 
the University of Alberta. Details of this pilot study are 
given in Chapter III. 

Confidence Weighted Scoring (CWS Method). This 
scoring method was used with true-false test items by Ebel 
(1965) and then discussed in details by Shuford, Albert, & 
Massengill (1966). It was then used by other investigators 
with some modifications (Hambleton et al., 1970; Hopkins etal., 
1973; Michael, 1968). In this scoring method, either the 
original point a subject gives to the correct answer is taken 
as the item score or a function (usually non-linear) of this 
value is used. In this study the technique was applied to 
answers under the Confidence Testing format. Five scoring 
functions, including two suggested and used in previous 
studies, were used to get different item scores. The total 
test score was the sum of item scores obtained by each 
scoring function. Details of the scoring functions are 
given in Chapter III. 

Elimination Scoring (ELS Method). This scoring 
method was used by Coombs et al. (1956) and was applied to 
answers under the Elimination Testing format in this study. 
The technique was slightly modified for use with a five- 


choice item. Each incorrect answer selected scored one; 
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the correct answer, if selected, scored -4; omissions 

scored 0. An item score was an algebraic sum of all response 
alternatives marked ranging from -4 to +4. The total test 
score was the algebraic sum of all the item scores. 

(c) School Achievement. This term was used with grades 
or scores obtained from schools at the end of the first term, 
of the 1972-73 school year. These scores were used as criteria 
in this study. Grades were obtained for the following 
subjects: 

(a) Language Arts. 

(b) Mathematics. 

(c) Science. 

The academic average was obtained from scores of 
all subjects listed above plus Social Science. 

Basic Assumptions. For the purpose of objective 
interpretations that would be drawn from this study, it 
was assumed that: 

(1) All test and criterion scores used in this study 
were reliable with respect to Conventional Testing and 
SCOEINGs 

(2) Subjects used in this study were only randomly 
different from one another across all test format groups. 

(3) Subjects clearly understood test instructions given 
at the testing times. 

(4) Subjects were cooperative and eager to obtain high 
Lest scores. 


(5) There were no differences between the experimental 
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and conventional test format groups with respect to the 
following factors: administration time, subjects’ and 


proctors' personalities, intelligence, and sex. 
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CHAPTER II 
RELATED LITERATURE 


When measures are to be combined to form a composite 
Measure or to predict a criterion, the question of differ- 
ential item or subtest weighting arises. There is actually 
a theoretical rationale to support this idea. The need for 
differential item-option weighting generally arises from the 
desire to improve the reliability and validity of a composite 
scores. The term differential weighting has been used for 
several techniques. Some investigators have weighted tests, 
some test items, and others item responses. Some even went 
beyond this by developing the method of response-determined 
scoring which can also be described as a form of differential 
weighting (Wang & Stanley, 1970). 

Although differential weighting of item or item- 
option theoretically promises to provide substantial gain in 
test reliability and validity, in practice, some approaches 
employed often gain so slightly that they do not seem to 
justify the labour involved in deriving the weights and then 
employing them in scoring. A number of psychologists have 
concluded that some of these weighting approaches, especially 
the item-weighting type, are not worth the trouble ( Guilford, 
1954; Gulliksen, 1950). However, the differential item- 
option weighting on aptitude or achievement tests has shown 
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potential, and it has been proposed by test specialists 
that reliability and validity of a test may be increased 
if a subject himself assigns weights to options according 
to his confidence in the correctness of each option (Wang 
& Stanley, 1970). 

There are several distinct techniques of option 
weighting reported in the literature and some of them, with 
modification, have been used in this study. Each will be 


reviewed separately and then treated as an integrated whole. 


the Differential Weighting of Item Responses 
Differential weighting techniques differ from the 
conventional technique in that, instead of scoring one for 
a correct answer and zero for an incorrect or omitted response, 
weights (usually a priori) are assigned to each response 
alternative to an item. A test score then consists of the 
sum of the weights of the response alternatives that the 
subject chose in responding to the test (Hambleton et al., 
1970). A wide variety of techniques has been proposed and 
used to assign weights to response alternatives. Until 
fairly recently, the possibility of the differential 
weighting of item response was not considered in the 
literature (Wang & Stanley, 1970). However, after the 
demise of formula scoring which tried to eliminate the effect 
of guessing on test scores, some investigators have turned 
their attention to this new approach. The first step in the 


direction of differential weighting of incorrect response 
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was made by Nedelsky (1954). He used the opinions of 
experts to identify distractors with respect to the achieve- 
ment levels of subjects. In spite of the complicated technique 
he used, the composite score was considerably more reliable 
than the conventional score. 

Davis & Fifer (1959) took a significant second 
step after Nedelsky. These authors noted that the conven- 
tional one-or-zero scoring did not permit differentiation 
among examinees with respect to the type of distractors 
selected. Students should also be differentiated by the 
option they select. This idea really concerns the assessment 
of partial knowledge. Two subsequent steps were made. The 
first emphasized test reliability and the second, test validity. 
The tests used in this study were two forms of arithmetic 
reasoning consisting of 50 items each. Option weights were 
obtained empirically by three successive steps. First, two 
mathematicians rated all options with respect to their degrees 
of correctness on a seven-point scale. The weights were then 
used to score subjects" answers in a pilot study. The second 
step was to find a new set of weights using the correlation 
between marking the option and the total score obtained by 
the, first set. Option weights from this step were then 
modified and adjusted to form the final set for the main 
study. This complex procedure produced a satisfactory 
change in reliability but no increase in validity. The 
test-retest reliability of test scores under the experimental 


scoring method was .763 as compared with .684 obtained 
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under the conventional method, and the difference between the 
two values attained statistical significance. But there was 
no Significant increase in the validity coefficient. The 
authors' conclusion is that the variance introduced into the 
total score had increased the proportion of true variance, 
thus increasing test reliability, but that the new variance 
displayed the same concurrent validity as the original, thus 
resulting in the unchanged test validity. 

Jacob g Vandeventer (1968) undertook a study using 
the notion of facet analysis (Guttman ¢ Schlesinger, 1967) 
to obtain option weights on the Coloured Progressive Matrices 
Test (CPM). This procedure made use of the mean number of 
respondents choosing the distractor as an option weight. 

This a priori method of keying the response alternatives was 
shown to have a moderate degree of test-retest reliability, 
and concurrent and predictive validity (Hambleton et al.; 1970; 
Wang & Stanley, 1970). 

Sabers & White (1969) used differential weighting 
with 370 grade nine students divided into four groups. The 
experimental test was the Iowa Algebra Aptitude Test (IAAT) 
which was given to students while in the eighth grade. The 
upper and lower 27 per cent of a group were chosen on the 
basis of scores on an achievement test. The percentage 
within these groups marking an item option on the IAAT were 
used to obtain the weight for that option from the table 
prepared by Davis (1966). Weights obtained by this procedure 
from one group were used to score tests for another group in 


order to assess the cross-validity of the weighted scoring. 
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The criterion measures were 40-item multiple-choice 
achievement tests administered after the students had 
completed one semester in ninth-grade mathematics. The 
achievement tests were scored as the number correct. Sabers 
& White found that differential weighting resulted in small 
increments in both the reliability and predictive validity. 
They postulated that two factors contributed to the failure 
to achieve larger increments. These were: (1) the groups 
were not well matched on their aptitudes, and (2) the aptitude 
test had a relatively high degree of reliability. 

Hambleton et al. (1970) compared two a priori methods 
of option weights, one based on the procedure of Jacob & 
Vandeventer (1968) and one based on the average rank assigned 
to options by judges who ranked all options for correctness. 
The experimental method was used to score a five-option mid- 
term test and the criterion score was a final examination 
consisting of a 60-item multiple-choice test administered 
and scored under the conventional method. This study also 
included the confidence method which will be discussed later. 
The results showed that both sets of weights tended to 
increase estimated predictive validity of the midterm 
examination, and the reliability was slightly increased 
with the second set of weights. None of these increments 
of the validity attained statistical significance and no 
test of the differences between reliabilities was made. 
However, the authors noted that, in this study, the number 


of subjects in each group was rather small and that the test 
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used was easy for the group level being tested. 

The most recent study on differential weighting 
was done by Hendrickson (1971). She used option weights 
secured by using Guttman's technique for maximizing internal 
consistency. The weights were derived via an iterative 
procedure which began by assigning to a response alternative 
a weight equal to the mean total score on the remaining items 
of the sub-test obtained by the examinees who marked that 
response alternative. The technique was applied to a large 
number of subjects who took the Scholastic Aptitude Test 
(SAT), Verbal and Mathematics forms, totalling 100,000 
persons. Hendrickson found that weighting options resulted 
in increased reliability equal to lenghtening the test from 
19.09 per cent to 78.25 per cent of the original item number 
under the conventional scoring, but there were no clear 
effects on test validity which was slightly. decreased. 
Hendrickson concluded that weighted options resulted in 
increased homogeneity of the test thus changing what it is 


measuring, and thereby decreasing test validity. 


The Response-Determined Scoring 


All the weighting techniques discussed above have 
the characteristic of a constant multiplicative weight being 
directly associated with the response alternatives. Once the 
weights have been determined, the examinee's score on a test 
is completely determined by the response options he selects. 


Response-determined scoring represented by the elimination 
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and confidence weighted scoring methods is an alternative 
strategy for obtaining item scores. The distinctive char- 
acteristics of this scoring approach is that the examinee's 
response to an item consists of more than simply selecting 
the correct option. Under this approach, the concept of a 
"response' has been considerably broadened, and it is the 
characteristic of the response, rather than those of the 
item or the option, which determines the item score (Wang 
& Stanley, 1970). 

There are a large number of studies reported in 
the literature using one form or another of this scoring 
approach. Some of them have used the elimination method 
and others, the confidence weighting method, as used in 
this study. 

Dressel & Schmid (1953) studied four different 
experimental methods, two of which can be categorized as 
two of the methods under consideration in the present study: 
the Free-Choice Test and the Degree-of-Certainty Test. The 
Free-Choice test required students to mark as many choices 
as needed in order to be sure that they had not omitted the 
correct answer. Each incorrect mark was scored -1/4 point 
and the correct mark was scored one. The Degree-of-Certainty 
test, considered as one form of the confidence weighting 
rather than the elimination method, required students to 
indicate the degree of certainty they had in the single 
answer they selected by assigning one of four possible 


values. Item score for this latter method depended on 
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whether they selected the incorrect answer or not, and on 
the degree of certainty they assigned, resulting in the 
range from -4 to 4. The test used in this study was a five- 
option test and subjects were college students. This com- 
bination of response method and system yielded reliabilities 
of .67 for the Free-Choice test and .73 for the Degree-of- 
Certainty test as compared with .70 obtained with the conven- 
tional scoring procedure, but no difference attained 
statistical significance. The conclusion from the results 
of these two experimental methods, as referred to by 
Echternacht (1972), was that . 

- . . Superior students, defined in term of traditional 
test scores, differed significantly from average and 
poor students when using the free-choice format, the 
difference being that high performers marked fewer 
answers across each of three different levels of item 
difficulty. . .. The degree-of-certainty method, on 
the other hand, differentiated superior, average, and 
low-ability students about equally well, the confidence 
marks being about the same for both average and diffi- 
cult items. It was also concluded that the certainty 
factor measured by the free-choice item was not the same 
as that measured by the degree-of-certainty item (p. 221). 
Coombs et al. (1956), performed an experiment 
complementary to the Free-Choice test in which subjects 
were instructed to eliminate incorrect alternatives in four- 
choice test items, taking care not to mark the correct one. 
The technique is termed the Elimination Method in the 
present study. One point is gained for each incorrect 
alternative crossed out and three points lost if the correct 


response is crossed out resulting in item score ranging from 


-3 to 3. Three 40-item multiple-choice tests were used. 
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They were: a vocabulary test, a test of drive information, 
and a test of spatial visualization. Subjects were 855 junior 
and senior high school students, grouped under three methods 
of testing: the conventional method (C), the experimental 
method (E), and the 'both' method (B). These three groups 
were matched on aptitude as measured by several standardized 
tests. The main assumption of the study was that 
partial information exists and enters into answering multiple- 
choice items. The authors found that the reliability of tests 
under the experimental method showed increases equivalent to 
that produced by a 20 per cent increase in the length of a 
conventional test of the same type. The authors also pointed 
out that as the difficulty of the test increased, the reli- 
ability also increased, and that the same item discriminated 
well when administered in either multiple-choice or experi- 
mental formats. 

Another method of assessing partial knowledge entails 
the use of confidence weighting or the personal probabilistic 
approach, as preferred by some authors (de Finetti, 1965; 
Rippey, 1968; Shuford et al., 1966). The general technique 
of this approach is to ask students to assign weights to all 
response alternatives indicating preference or degree of 
belief for each one. These assigned weights are then sub- 
jected to some pre-determined scoring functions to obtain 
item scores. This procedure has its historical antecedents 
in connection with the true-false format, in studies by 


Henver (1932) and Soderquist (1936). More recently, Ebel 
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(1965) found that confidence weighting could improve true- 
false test reliability. 

Dressel & Schmid (1953) studied the confidence 
weighting technique with multiple-choice test items. With 
the weighting technique they used, a reliability of .73 was 
found using this experimental method as opposed to .70 for 
the conventional one. Michael (1968) studied the use of a 
10-point-confidence distribution scheme with 432 senior high 
school students in history classes using the STEP:Social 
Studies test, comparing with conventional scoring and formula 
scoring (correction-for-guessing). She reported evidence of 
increased reliability from .764 to .840 when using the con- 
fidence testing. No test of differences between the two 
values was made. Her conclusion was that the confidence 
weighting method affords considerable promise in affecting a 
higher estimate of reliability and a lower standard error of 
measurement than does either the conventional or the formula 
method, and that the confidence weighting method was a 
workable technique that could be employed by the average 
classroom teacher. 

Hopkins et al. (1973), used only three levels of 
confidence distribution, H, M, and L, in their study with 
63 graduate students taking a statistic course. The test 
used was a 65-item multiple-choice test with a short-answer 
test measuring the same content as a criterion for validity. 
The results of this study showed an increased reliability 


from .883, by the conventional method, to .915, by the 
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experimental method, but decreased validity from .701 to 
-661. However, these differences did not attain statistical 
significance »,.Thefindings lLedsto thetconclusionsthatwa . .« 

- - . the added reliable variance often observed in 
confidence testing studies may be irrelevant response 
style variance and does not increase validity, in fact, 
Iitemay actuallvediminish validity (Hopkins. et, al., 1973, 
SgilA 0) les ae Ban 
Studies discussed thus far have shown that differ- 
ential weighting of distractors including the several forms 
of confidence weighting in aptitude and achievement tests, 
have been examined with a great interest. Rarely have 
studies used both the elimination and confidence weighting 
approaches in a single investigation. However, there are 
two pieces of work in which this has been done: one is by 
Bambleton.e teal. , (197.0) yand tthe jothenrtby: Colletta th)askhin 
the studyiby Hambleton}pet ral. ,ithe pauthors,used both differ-— 
ential weighting and confidence weighting in the same study. 
The procedure for differential weighting has already been 
described. In the confidence weighting part, students were 
asked to distribute 100 points of degree of certainty among 
five response alternatives on a specially designed answer 
sheet. These subjects' confidence weights were then used 
to obtain item scores via a logarithmic function, the version 
which corresponds to the first logarithmic function used in 
the present study. The authors reported insignificant 
improvement in validity (.72 as compared with .62) for 


confidence weighting over conventional scoring, but decreased 


reliability (.655 as compared with .711). No test of 
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differences between reliabilities was made. 

Collet (1971) compared new scoring approaches with 
the correction-for-guessing method. This study used two 
experimental scoring methods, the differential weighting 
and the elimination which is one type of confidence weighting. 
The elimination method was the same as that used by Coombs 
et al. (1956), and the differential weighting method was 
similar to that used by Davis & Fifer (1959). Two 50-item 
multiple-choice tests were obtained from two parallel forms 
of the Henman-Nelson Test of Mental Maturity (college level) 
and were given to six, 47-student groups of undergraduate 
students. Criterion scores for validity were obtained from 
the Washington Pre-College Test administered some 18 months 
before the study. The results indicated that both reliability 
and validity obtained by the elimination method were higher 
than those obtained by the correction-for-guessing method. 
However, the results were reversed with the differential 
weighting method. Only the validity obtained by the elimin- 
ation method was significantly different from the correction- 
for-guessing method. The author's conclusion was thus in 


favour of the elimination approach. 


Personality Influences on Test Scores 


Personality traits have long been suggested as 
possible factors influencing test scores under new scoring 
approaches (Coombs et al., 1956; Michael, 1968). However, 


there are only two studies reported in the literature that 
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deal directly with these influences. Hansen (1971) studied 
the influence of variables other than knowledge on confidence 
or the probabilistic test scores, as he called them. Person- 
ality factors, in this study, were Rist Taking, Test Anxiety, 
and some others as measured by the F-scale developed by 
Christie, Havel, & Seidenberg (1958). The author found 
that the response style is related to certain aspects of 
personality. 

Echternacht, Boldt, & Sellman (1972) attempted 

to evaluate the association of personality variables with 
confidence testing in light of practice. The testing tech- 
nique that they used were called the "Distribute 100 points" 
and the "Pick-One" techniques. The first one required 
students to respond with subjective probabilities to each 
of the item alternatives, and the latter required students 
to select the best alternative and then rate their confidence 
in that choice on a five-point scale. Item scores were then 
computed using a logarithmic function similar to that suggested 
by shufordeuw al. (1966). Subjects were 192 males in the U.S. 
Air Force. A personality test battery was developed from 
several well-known personality tests. The results showed 
some significant correlations between test scores and 
personality factors but the evidence did not hold up with 
replications. All of the findings led the authors to the 
conclusion thate se <=" 

. . . the personality variables are not related to 


confidence test scores when achievement, as measured 
by the number of items correctly answered, 1s controlled, 
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and when sufficient practice with the system has been 
employed - -« e grave skepticism about the use of con- 
fidence measures due to undue effects is probably not 
justified. At least, such reservations are not warranted 
any more for confidence measures than they are for tra- 
Say aa multiple choice measured (Echternacht et al., 


Theoretical Facet of the New Techniques 


In contrast to the many empirical studies on the 
assessment of partial knowledge, there are just a few dealing 
with theoretical aspects of these approaches. De Finetti 
(1965) introduced the use of Decision Theory into the pro- 
blem of testing, calling it the Personal-Probability Approach. 
He suggested six preliminary assumptions which constitute the 
underlying philosophy of the approach. They were: 


1. The scoring method and permitted modes of responding 
must be known to the subjects, the subjects fully 
understanding the implications in the face of 
uncertainty. 


2. The subjects must be keenly interested in scoring 
lange hey 


3. The subjects must be trained to understand the 
correspondence between their own belief and the 
numerical probabilities to which they were trans- 
lated. 


4. The total knowledge and belief of a given subject 
about a question and its alternatives must be 
expressed and fully represented by numerical 
probabilities he attached to each of the alter- 
natives. 


5. The scores using any scoring method can be divided 
so as to determine the partial information of a 
subject from his responses. 


6. The evaluation of this procedure should concern 
how well the scoring method describes the sub- 
ject's belief and its value to him (Echternacht 
(hela Wl green ae Wy 4) Pe 
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In addition to these assumptions, de Finetti 
attempted to provide some rationale for behavior as he pre- 
sented and discussed various scoring schemes. 

Stokes (1966), in his article suggesting a new 
testing technique called "Split-Response Technique," viewed 
teachers' advantages as incentives for the use of a new 
approach. He suggested no theoretical implications of the 
technique. In his view, the Split-Response Technique will 
give more information to teachers especially in the following 
ways: 

- Ambiguous otherwise poor questions can be detected 
by the frequency with which they give rise to split 
responses. 

- Particular alternatives which are split testify to 
students uncertainty and point to meaningful test 
review items. 

- Split responses on non-ambiguous questions serves 
as a barometer for ineffective teaching. 

- Student confidence can be estimated better by 
observing the degree of splitting relative to the 
class. 

- New freedoms in test design are possible, the teacher 
using questions with several correct alternatives 
which require splitting. 

- New insight into student personality are possible 
(Stokes, 1966). 

Shuford et al. (1966) gave a well known discussion 
on "Admissible Probability Measurement Procedures." Their 
objective was to extract a larger portion of the available 
information from objective test items. This information, 
as they stated, was contained in the student's degree-of- 
belief probabilities or personal probabilities concerning 


the correctness of the various possible answers. To 


measure these probabilities, they contended, a scoring 
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system must be devised so that any student could maximize 
his expected score if, and only if, he honestly reported his 
probabilities. Scoring systems that made use of this 
property and were understood by students were termed admis- 
sible probability measurement procedures. In their view, 
most commonly used measurement procedures were not admissible. 
The authors also introduced the concept of a scoring system 
with a reproducing property; a scoring system was reproducible 
when the personal probabilities possessed by the examinee 
were identical to the probabilities with which he STE) poie lose 
They derived some necessary and sufficient conditions for 
the reproducibility of a test item with two possible alter- 
natives. They further showed the class of reproducible 
scoring systems to be virtually inexhaustible and demonstrated 
a method of construction. However, all scoring functions 
they suggested and proved reproducible, except a logarithmic 
function, depended on both probabilities assigned to correct 
and incorrect alternatives. And because of the unbounded 
property of the logarithmic function when the probability 
assigned to the correct answer was zero, they suggested an 
approximation solution, a truncated logarithmic function, 
in which the value of -l was given to log 0. This logarithmic 
function was used in several studies (Echternacht et al., 
1972; Hambleton et al., 1970; Rippey, 1968, 1970). 

Lord & Novick (1968) devoted an entire chapter 

to the problem of measurement procedures and item scoring 


formulas. They began by stating that the general problem of 
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obtaining the maximum amount of information from a given 
set of items contains three major components. The first is 
the measurement procedure, or the manner in which the examinees 
respond to the item. The second is the specification of the 
item scoring rule or formula that is used for scoring each 
item. The third is the combination of item scores into a 
total score by an item ier hiadni formula. The first two 
procedures are concerned directly with the problem of item 
scoring and the third one examines the problem of item 
weighting which is not being considered here. When dealing 
with the problem of choosing either simple or more complex 
measurement procedures, they suggested that... 


- . . it may be that examinees are available for a 
relatively long period for testing, but that test items 
are very difficult to obtain. Then we would want to 
obtain as much information as possible from each item, 
and hence we would be tempted to employ more complicated 
measurement procedures, if they were indeed useful. ... 
It may be that items are plentiful but examinee time is 
scarce. It may also be reasonable to assume that per 
unit of time, we can probably get more information by 
adding more items (if available) than by introducing 
complex measurement procedures. Then we would probably 
be inclined to use the simpler measurement procedure so 
that we might administer as many items as possible in 
the limited amount of time (Lord ¢ Novick, 1968, 

Bp. e035) 


After a review of some possible scoring approaches, 


they advised that... 


. . . what little experimental work has been done on 

the traditional methods of formula scoring has not been 
encouraging, and that no experimental work has been pub- 
lished that supports the new methods. Thus, at present, 
the sole recommendation of these new methods is their 
strong conceptual attractiveness. In evaluating any 
new response method, it will be necessary to show that 
it adds more relevant ability variation to the system 
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30 
than error variation, and that any such relative 
increase in information retrieved is worth the effort, 
wa.  (LOLUMS Novick, Hiogemp ol4):. 
However, the authors preferred method seems to be 
that of the Personal-Probability Technique. They recommended 
thatese 27 
That assumptions of the personal probability model are 
certainly more realistic than the assumption of the 
random guessing model. Many of the questions raised 
- - - May well be answered satisfactorily by empirical 
Studies (Lord & Novick, 1968, p. 320). 

Conclusion 

This chapter has reviewed the literature on the 
problem of assessing partial knowledge, both empirical and 
theoretical. The literature indicates that partial know- 
ledge can be measured and described some promising potential 
approaches. It also shows that further studies are needed. 
It is hoped that the present study will make, at least, a 


partial contribution to the clarification of the problems 


under consideration. 
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CHARPTERMI DD 
RESEARCH METHODOLOGY 


Design of the Study 

The present study employed three randomly assigned 
groups: Conventional Testing, Confidence Testing, and 
Elimination Testing. Subjects were 1028 grade nine students, 
of both sexes, selected from eight schools in Edmonton, 
Alberta and vicinity, during the first term of the 1972-73 
school year. The testing instruments were two forms: A and 
B, Of vocabulary and mathematics aptitude tests. 

There were two testing times; the first was in 
October and the second in November and December 1972, with 
approximately three to five weeks between sessions for each 
class. At both testing times, students in three groups from 
the same classroom sat to write tests in the same room. They 
were given two form-A tests, vocabulary and mathematics 
aptitude tests, at two successive times, with one of 
three different test-taking instructions: Conventional 
Testing, Confidence Testing, and Elimination Testing, 
depending on which group they were assigned to. 

In addition to form-A tests, form-B tests were given 
to students at the second test session after the first form in 
the same sequence: vocabulary and then mathematics aptitude, 
using Conventional Testing for all students. 
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Students' answers to form-A tests, using several 
scoring methods, provided data for test reliability. Answers 
to form-B tests, using the Conventional Scoring method, pro- 
vided data for test validity. 

At the end of the first term (three to five weeks 
after the second test session), school final scores from three 
subject areas: Language Arts, Mathematics, and Science, 
together with an Academic Average, were obtained from school 
files. These scores, after standardization, were used for 
test validity with respect to students' achievement in schools. 

Details regarding subjects, tests, test administra- 
tion procedures, scoring techniques, and school achievement, 


are presented in subsequent sections. 


Subjects 


Early in September 1972, five school systems in 
Edmonton and vicinity were requested to take part in the 
present study. Eight schools were willing to participate 
in the project. Two test sessions were then taken during the 
months of October and December 1972, with a 3-5 week interval 
between sessions for each class. The total sample, after 
excluding those whose scores were not complete, consisted of 
1028 students. 

Before the first test was taken, students in each 
class were randomly assigned into three comparably-numbered 
groups: the Conventional Testing, the Confidence Testing, 


and the Elimination Testing groups. The group division was 
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33 
made in each class for the purpose of eliminating class 
biases. However, during testing, all students sat at their 
usual places regardless of the group to which they belonged. 
The only difference among the three groups was that they were 
given different test-taking instructions. 

No effort was made to equate these groups and also 
no analysis was made to prove that they were equivalent with 
respect to any external or internal criterion. Each group 
was assumed to be randomly drawn from the population under 
consideration. 

Table 1 shows the size of each group and some other 


details. 


Instruments 

Four tests were used in this study, Vocabulary Test: 
form A and B, and Mathematics Aptitude Test: form A and B. 
All items in these four tests were compiled from the Kit of 
Reference Tests for Cognitive Factors--Revised Edition-- 
(French, Ekstrom, & Price, 1963). The following are 
details of each test: 


Vocabulary Test: Form A (VA). 5 choices, 25 items 


(12 minutes for all groups). 


Example: jovial 
1 - refreshing 
2." seare 
3 - thickset 
4 - wise 
5a= OLLY 


Items in this test are the same as the first 25 items 
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35 
in Vocabulary Test--V-2 in the Kit and are “also in the same 
sequence. This test measures verbal comprehension. The 
test was used under all three test-taking instructions and 
answered on three different answer sheets. Either one of 
three instructions were printed on the front cover of a test 
booklet. Time limits were equal for all groups and set longer 
than the original test to allow enough time for students to 
finish their answers. 

Vocabulary Test: Form B (VB). 4 choices, 30 items 
(8 minutes for all groups). 


Example: attempt 


ves run 
2 - hate 
SESHERY 
fo SCOP 


Items in this test are the same as the first 30 items 
in Vocabulary Test--V-1, which is a parallel form of form V-2, 
in the Kit, and also in the same sequence. This test was 
used under the Conventional Testing method for all three 
groups using IBM optical answer sheets. Conventional test 
instructions were also printed on the front cover of the test 
booklets. Time limits were set equal to the original form, 
although the number of items was five less, to allow enough 
time for students to finish their answers. 

Mathematics Aptitude Test: Form A and B (RA-RB). 


5 choices, 15 items each (15 minutes for RA and 10 minutes 


for RB). 


Example: How many pencils can you buy for 50 
cents«at the rate: of:2 for. 5»cents? 
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Items in both tests were randomly assigned from all 
30 items in Mathematics Aptitude Test--R-1l in the Kit which 
measures numerical reasoning. Form A was used under three 
different test-taking instructions and answered on the same 
answer sheets as used for VA test. Form B was used under the 
Conventional Testing method and answered on the same IBM 
answer sheet as VB test. Specific test instructions for 
each method were also printed on the front cover of the test 
booklets. The time limit was set longer for RA to allow 
students enough time to finish answers under the experimental 
methods, but for RB the time limit was the same as half of 
the original test. 

Answer Sheets. There were three different formats 
for the answer sheets, one for each test-taking method, for 
VA and RA tests. The Conventional Testing was answered on 
an IBM answer sheet and could be scored by machine. The 
other two forms, especially designed for the students’ 
convenience in answering had to be scored by hand. Answers 
to VB and RB tests were also on IBM answer sheets and thus 
scored by machine. It is not impossible, however, to develop 
answer sheets for these experimental methods for machine 
scoring but this work should be considered only if these 
methods proved worth the effort. 


Test instructions and answer’. sheets 
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are given in Appendix A. 

Preliminary Test for Time Limit. Before the first 
test was taken, a class of 28 grade nine students in F. R. 
Haythorne Junior High School were given the VA and RA tests 
under the two experimental test-taking methods, CF and EL. 
The purpose of this test was to try test-taking instructions 
and set the time limit. After the testing, oral instructions 
were revised and time limits were given for each test. 
Students used for this pilot testing were not included in 


the main study. 


Test-Taking Instructions 


There were two types of instructions for VA and RA 
tests. The first, given orally by the proctors, consisted of 
general information for all three groups, and covered the way 
test booklets were distributed, and what students had to do 
during the testing time. Approximately five minutes was 
used for this. | 

The second part was the main and specific instruc- 
tions for each test-taking method and was printed on the 
front cover of the test booklets; each booklet had only one 
set of instructions. These instructions described the method 
of answering and included examples. They also encouraged the 
student to ask questions if he was unsure of the correct 
procedure. 

The VB and RB tests had only printed instructions, 


since this method was already known to all students. 
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38 
Both types of test-taking instructions are given 


in Appendix A. 


Test Administration Procedures 

For both testing times and all 44 classes in this 
study, a group of graduate and undergraduate students in the 
Faculty of Education, University of Alberta, were used as 
test proctors. Most were experienced teachers. Information 
and details about the project, and the precise procedures to 
use were given to them before testing. Each proctor was 
given the "Instructions to Examiners" and "Instructions for 
Students" when he went testing, and each was responsible for 
one class at a time. The first instructions were an outline 
of administration procedures, and the second were the oral 
instructions to read to the class before testing. 

The first testing time began with a distribution of 
VA test booklets which were pre-arranged in an alternate 
order: CV, CF and EL test booklets, accompanied by answer 
sheets. This pre-arranged sequence of test booklets was used 
for the purpose of random assignment of students into groups: 
Thus, when the first test was given, all students were then 
grouped randomly, regardless of their sex or any other 
criterion. This procedure was convenient and successful in 
terms of randomization. The proctor, then, read the general 
instructions to the class, and asked the students to go on 
reading the specific instructions on their test booklets. A 


question period was given and followed by students writing 
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the test. The second test, the RA test, was given when the 
first test was finished. Students received test booklets 
under the same test-taking instructions. No general instruc- 
tions were given at this time, and the students were asked to 
read the specific instructions on the test booklets, and 
start writing the test after another question period. The 
RA test was answered on the same answer sheets as used with 
the VA test. 

After the first test session, students' names in each 
class were recorded according to their groups on separate 
Sheets. These lists were given to proctors at the second 
testing time. There were two parts at this test session. 

The first, for VA and RA tests, was carried out in the same 
manner as that of the first session, except that, at this 
time, test booklets were distributed to students according 
to groups and names on the lists. The VB and RB tests were 
given in the second part at two successive times, and were 
answered on the same answer sheets. These two tests were 
answered under the conventional method, so there were no 
special instructions. Students were asked to read the 
specific instructions on test booklets and started writing 


a test after a question period. 


A Technique for Differential Weighting of 
Item Options 


The differential weighting procedure used in this 
study was a simple one if compared with those reported in the 


literature. Hambleton et al.(1970) used two different 
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procedures; one resulted from a rather complex manipulation, 
The other, which was similar to the technique in this 
study, was simpler. The second procedure, in their study, 
started with 22 experts rank-ordering for correctness the 
five item options. The average rank among these experts was 
then obtained for each alternative, including the correct 
answer. The distractor with the second lowest rank was 
weighted three, and so on to the distractor with the lowest 
rank which was weighted zero. This procedure resulted in a 
discrete weighting system for all item options from four, the 
correct answer, to zero, the least correct distractor, with 
an equal value of one step apart. 

The procedure used in the present study started with 
44 raters, a group of undergraduate and graduate students 
taking courses in Educational Measurement at the University 
of Alberta during the 1972-73 winter session. These students 
were asked to rank order the four incorrect options for their 
degree of correctness, in both VA and RA test items. These 
ranks were then converted into weights giving four to the 
highest and so on down to one as the lowest rank. The average 
weights among all 44 raters for each incorrect option were 
then obtained with the weight of five assigned to the correct 
answer. These weights were used to score answers under the 
conventional test-taking method. Option weights obtained by 
this technique were not discrete and varied both among options 
in the same item and across items. This system of weighting 


seems more reasonable than the second one used by Hambleton 
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et al. (1970), since the differences between option weights 
reflect their varied degree of correctness. This was the 
assumption underlying the differential weighting approach. 
The reliability of rating by the analysis of variance method, 
using unadjusted reliabilities (Winer, 1970, p. 283), was .895 
for VA and .906 for RA tests, suggesting a very high degree 
of agreement among raters. 

Tables of raters by classes, and option weights for 


each test, VA and RA, are given in Appendix A. 


Scoring Techniques 


Four different scoring techniques were used in this 
study. They were applied as follows: 

(1) Conventional Scoring (CVS Method). This scoring 
technique was used to score answers from VA and RA tests 
under the Conventional Testing and the Elimination Testing, 
and also all answers from VB and RB tests. Under the Conven- 
tional Testing, a correct answer was scored one, otherwise 
including omission was scored zero. Under the Elimination 
Testing, an answer with all incorrect alternatives crossed 
out was scored one, otherwise including omission was scored 
zero. The total score was a sum of item scores and no 
guessing formula was used. 

(2) Differential Weighted Scoring (DWS Method). This 
scoring technique was used to score answers from VA and RA 
tests under the Conventional Testing. Each incorrect option 


had a predetermined weight obtained from the technique of 
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differential weighting described above, with the correct 
answer having a weight of five. An item score was the weight 
of the option selected, and the total score was a sum of 
item scores. 

(3) Confidence Weighted Scoring (CWS Method). This 
scoring technique was used to score answers from VA and RA 
tests under the Confidence Testing. There were five scoring 
functions used under this scoring technique. In all cases, 
the total score was a sum of item scores. The following are 


these five scoring functions: 
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where: Sis is the item score for the ith person and 
J jth item. 


He is the degree of confidence (in points) a 
J student assigns to the correct answer 
(possible values run from 0 to 10). 


log is the common logarithm (base 10). 
a is a normalized standard score of rij on 
ij the distribution of 11 possible values of 
ri4 (see Table 2 for the method of norma- 
SE cteLOTl) 


(4) Elimination Scoring (ELS Method). This scoring 


technique was used to score answers from VA and RA tests under 
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TABLE 2 


NORMALIZED STANDARD SCORE OF r;,~ ON THE DISTRIBUTION 
OF 11 POSSIBLE VALUES OF rij 


951 Ga8 read from the table of normal distribution. 
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44 
the Elimination Testing. Each incorrect option crossed 
out was scored one, and the correct answer, if crossed out, 
was scored -4. Omissions were scored zero. An item score 


was an algebraic sum of item scores. 


Problems With the Confidence Scoring Functions 

Some discussions about scoring functions used in 
the Confidence Weighted Scoring method are needed to clarify 
their characteristics. Two important points were considered 
when using these functions in the study--the problem of the 
value to be assigned to log 0, and the various possible 
values of item scores from different scoring functions. The 
problem of the log 0 value will be discussed first and followed 
by the problem of item scores. 

A Pilot Analysis on the Value of log 0. When Shuford 
et al. (1966) introduced the reproducible scoring system into 
the Confidence Testing approach, the logarithmic scoring 
function, they contended, was the only reproducible scoring 
function that depended solely on the probability assigned 
to the correct answer. The function, however, had one 
difficulty--the unbounded property of log 0 value. They 
suggested an approximate solution, a truncated logarithmic 
function setting the value of log 0 at -l1. This function 
was used in studies by Hambleton et al. (1970), and Rippey 
(1968, 1970), with some minor changes in the form of the 


function. 


In this study, there were two logarithmic functions, 
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one of which was a modified form of that suggested and used 
in the literature but the other was original. Since the 
problem of log 0 value could not be avoided, and its value 
was likely to affect the distribution of test scores and 
might result in different outcomes for the study, it was 
contended that the problem should be investigated and deter- 
mined before any further analysis. A pilot analysis was thus 
undertaken with answers from VA and RA tests obtained under 
the Confidence Testing to examine the test reliability. The 
second scoring function was used in this analysis with three 
different values of log 0: -1l, -.30, and 0. The different 
sets of possible item scores are shown in Table 3. These 
score sets differed only on the first value. Reliabilities 
of test scores (alpha coefficient) from the three sets of 
possible answers were obtained by the analysis of variance 
method (Winer, 1970, p. 289). Results of this analysis are 
shown in Table 4. 

The alpha coefficients obtained from the use of 
three different values of log 0 (Table 4) showed clearly the 
effect of this problem on test scores. In all cases, the 
results were consistently in favour of the 0 value. The 
value of -l, however, gave consistently higher alpha values 
than those obtained using the value of -.30, except for the 
second RA test. It is likely that the first possible score 
set, tying the value of log 0 with log 1, was the best one 
among them. It was also assumed that the same results would 


be obtained if applied to the third scoring function. 
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TABLE 3 


THREE SETS OF POSSIBLE ITEM SCORES FROM 
DIFFERENT VALUES OF log 0 
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TABLE 4 


RELIABILITIES (ALPHA COEFFICIENTS) OF TEST SCORES OBTAINED 
FROM THE USE OF DIFFERENT VALUES OF log 0 


4 if ad's ie eae ile 
MORE ima? ateLabod mo. ange 
; ig 6t. 7 Tater aah Ip wepobey OS he ered 
j : oar ie » ee 1 > oe aoe . 
7 . = ; a Une * 7 - 
——— — - — ee ert ee ek ee eel 


e 


7 in Bod Al 
* ptt Cade ee teRe. 
Or 2 8 .¥ ja! % é f 


oo 


OO.£ ee. oC. 28. BY. OM, 05. 8b. OE. - 
Q0.r ze. 0c. 28. 6t.. OF. OB. ‘ed. Ge. Db OEpO- 
OO.£ 2@. OC, 28. BY. OF. O8. BR. O&. 0 O0,f-] O0.t- 


TAU 
© tee 
s 


b GISAT — 


GUAIATEO 234008 TAT TO (2TMATDTAIICO ANGGA) eae 
0 pol YO edUJAV THAATTIIG FO Seu SAT 


47 

As a result of this pilot study, the value of 0 was given 
to log 0 in both logarithmic functions for use in the main 
analysis. 

Possible Item Scores from Different Scoring Functions. 
Scoring functions under Confidence Testing, as shown in the 
previous section, present one interesting characteristic and, 
hence, deserve some discussion here. Table 5 shows various 
possible item scores obtained from their use, and 
Figure 1 is the plot of these scores (x) against their ranks 
(m). By considering the plotted lines, it is apparent that 
each function has rates of increase between two successive 
values that differ from the others. Among these functions, 
only the first one appears as a straight line; all others 
are in curves with different forms. This evidence suggests 
that these functions do not give the same increment between 
successive possible scores both within and across functions. 
To see their patterns more clearly, Table 6 was constructed 
showing differences between two successive possible scores 
(d_) obtained under each function. These values (d,) were 
then plotted against their ranks (n) as shown in Figure 2. 
Now, it is clearly seen that what was noted is true. Each 
scoring function has its typical pattern of possible score 
increments. Function 1 has a special pattern different from 
others with all equal score increments resulting in a straight 
line parallel to the horizontal axis. Other functions have 
different regular curves. It is noted that the irregularities 


appearing at the beginning of the second line and at the end 
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TABLE 5 


POSSIBLE ITEM SCORES FROM FIVE SCORING FUNCTIONS 
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TABLE 6 


DIFFERENCES BETWEEN TWO SUCCESSIVE POSSIBLE 
ITEM SCORES FROM FIVE SCORING FUNCTIONS 
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of the third line in Figure 2 are results of giving the 
value of 0 to log 0. These irregularities will not be 
considered here. 

The differences among the possible score increments 
resulting from the use of these functions have significant 
meaning to test scores and may affect test characteristics. 
More consideration is needed to understand the underlying 
meaning and make use of their merits in appropriate ways. 
The following is a summary of findings about these functions 
read from Figure 2. 

Function 1. This function makes use of the points 
of confidence assigned to the correct answers without any 
change. All possible scores are then linearly related to 
the given confidence levels. When possible points of con- 
fidence are in the form of natural numbers from 0 to 10, the 
increments between successive values are constant, i.e., all 
equal to one. In terms of assigning scores according to 
personal confidence, it means that this scoring function 
gives an equal increment of reward to equal increments of 
confidence level, no matter how high or low the confidence 
level is. 

Function 2. This function, and all the following, 
changes possible points of confidence into another series of 
values. This conventional logarithmic function gives a 
decreasing increment pattern; the higher the point of con- 
fidence is, the smaller is the score increment. In terms of 


assigning scores according to personal confidence, this 
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scoring function gives a high increment of reward to a low 
confidence level, and gives a low increment of reward to a 
higher confidence level. The score increment varies 
inversely with the confidence level. 

PunctilonesentFunctions#3 toe5 werendniginalmtohthis 
Study. Function 3 is logarithmic and was purposely designed 
to be complementary to the conventional logarithmic one in 
terms of possible score increments. It gives an increasing 
increment pattern; the higher the point of confidence is, 
the larger is the score increment. This function gives more 
reward to high levels of confidence. The score increment 
varies directly with personal confidence in the correct 
answer. 

Function 4. This quadratic scoring function behaves 
in somewhat the same manner as the third one. The difference 
between them is the pattern of increment. The score increment 
in this function increases faster than that in the former one. 

Bunction 5. This: function is likely to give a ‘com-— 
promise between functions 2 and 3, the two logarithmic 
functions. It gives both decreasing and increasing increments. 
At the lower part of the confidence distribution, the incre- 
ment decreases when the confidence level increases, and then 
reverses at the higher part, with the turning point at the 
median. This function gives more increment of reward to 
confidence levels at both extremes. 

Consideration of these scoring functions leads one 


to realize that they represent four distinctive scoring 
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models based on score increments as their main character- 


isticcssiSincesalliscoringsfunctionsaused anithisostudyfwith 


the confidence-weighting test-taking approach depend entirely 


on the points of confidence assigned to the correct option, 


these score increment models are also dependent upon this 


limitation. The following are a presentation of these models 


in mathematical form: 


Let X10 X51 X31 ees Xn 
chia Yo: Yay ° 1 Ym 
Cee re eer 


be possible item scores from 
the lowest to the highest 
values, 


be equally spaced points of 
confidence assigned to the 
correct answer, from the lowest 
to the highest values, and 


be the difference between two 
successive possible item scores. 


Then we have the following four models: 


Model 1: Constant Increment Model 


Bye See) 
and 


a. - ral 


This model gives equal distance between two successive 


possible item scores when the points of confidence increase 


equally. Function 1 is one scoring function based on this 


model. 


Model 2: Decreasing Increment Model 


x = fy 


and 
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This model gives decreasing distance between two 
successive possible item scores when the points of confidence 
increase equally. Function 2, the conventional logarithmic 
function, is one form based on this model. 
Model 3: Increasing Increment Model 
xe fly) 


and 


This model is complementary to the second one, and 
gives increasing distance between two successive possible 
scores when the points of confidence increase equally. 
Functions 3 and 4 are among possible forms based on this 
model. 

Model 4: Normalized Increment Model 


x, = Fly,) 


and 


x wom NO, 1) 


This model gives a normal distribution of possible 
item scores when the points of confidence increase equally. 
Function 5 represents the use of this model. 

An innovation in this study is the analysis of the 
theoretical models underlying the scoring functions. The 
first two models underlie scoring functions used in previous 
studies. Models 3 and 4 and the scoring functions based on 
them are unique to this study. 


The analysis of the models was undertaken on the 
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assumption that the different scoring functions would 
affect test scores in different ways, particularly their 
reliabilities and validities. These latter factors are the 


main concern of this study. 


School Achievement 

School achievement scores used in this study were 
obtained from three subject areas: Language Arts, Mathematics, 
and Science. An academic average was also computed based upon 
four subjects including the above three and Social Science. 
Marks from schools were obtained in three different forms: 
letter-grades, stanine-grades, and raw scores. Since marks 
from the same subjects had to be combined into the same set, 
each was standardized using the mean and standard deviation 
of each school group for conversion. A frequency distribution 
for each school group and each subject was constructed before 
standardization. The letter-grade system needed special 
treatment to fit into this table. These grades were first 
changed into numbers and the numbers were then used to 
construct the frequency distribution. Table 7 shows the 
numbers assigned to letter-grades before further treatments. 

Four sets of standard scores were then obtained by 
the procedure described above within each group; thus there 
were 12 sets of school achievement criteria. These scores 


were used in the main analysis to provide test validity. 


Data Obtained to Test Hypotheses 


The design and procedures described in previous 
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TABLE 7 


NUMBERS ASSIGNED TO LETTER-GRADES FOR STANDARDIZING 


Grade Number 
A be 
A- 10 
Bt 9 
B 8 
B= 7 
(eES 6 
C 5 
Ca 4 
D+ 3 
D 2 
BE uy 
F 0 


sections were carried on to obtain the following data: 

The First Test Session. Eighteen sets of data were 
gathered from this session, nine for each VA and RA test. 
They were: 

(1) Two sets of scores from each test under the Con- 
ventional Testing, one by the Conventional Scoring and the 
other by the Differential Weighted Scoring. 

(2) Five sets of scores from each test under the Con- 
fidence Testing using the five scoring functions. 

(3) Two sets of scores from each test under the Elimin- 
ation Testing, one by the Conventional Scoring and the other 


by the Elimination Scoring. 
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The Second Test Session. Twenty-four sets of data 
were gathered at this time. Eighteen of them were the same as 
obtained from the first session and six other sets were 
obtained from VB and RB tests, two from each group. Scores 
from VB and RB tests, though tested and scored under the 
same methods, were recorded separately by group for the 
purpose of analysis within group. Thus six score sets were 
gathered rather than two. 

Achievement Criteria. Twelve sets of achievement 
scores were obtained and recorded in standard score form by 
the procedure previously described. There were four sets 


for each group. 


Summary of Testing Design and Data Sets 

To give an overall view of the testing design and 
data sets gathered in this study, Tables 8 and 9 were con- 
structed. Table 8 shows the testing and scoring methods for 
VA and RA tests at both test sessions. As reported previously, 
there were initially three test-taking methods and four 
scoring methods used in the study. Table 9 shows all data 
sets included in the analysis, classified by tests, testing 
times, and groups. Achievement scores recorded from schools 
were obtained by conventional examinations so no scoring 


method was specified. 


Statistical Hypotheses 


The main purpose of this study was to examine test 


reliability and validity as indices of the effectiveness of 
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TABLE 8 


TESTING DESIGN FOR VA AND RA TESTS AT 
BOTH TESTING TIMES 


Test-Taking Method 


Scoring Method 


different testing and scoring methods. The following hypo- 
theses were tested: 

(1) The coefficients of internal consistency (alpha 
coefficient) from test scores under the experimental test- 
taking and scoring methods should be consistently higher than 
those obtained under the conventional methods. 

(2) The coefficients of stability (test-retest reli- 
ability coefficient) obtained from test scores under the 
experimental methods should be consistently higher than 
those obtained under the conventional methods. 

(3) Test scores obtained under the experimental methods 
should be more valid, with respect to external criteria, than 


those obtained under the conventional methods. 


Statistical Techniques 


Statistical analyses used in this study concerned 


the following computations: 
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TABLE 9 


A SUMMARY OF DATA SETS IN THE STUDY 


lst Session 


2nd Session 


Language 
Arts 
School 

Achievement Mathematics 


Science 


Academic 
Average 


Total 


* 
Signifies data obtained from schools. 
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(1) One-way analysis of variance with repeated 
measures to obtain internal consistency reliability (alpha 
coefficient) of test scores, i.@., adjusted reliability 
(Winer, 19707 p.8299)), wand reliability: of Eatings i.e, 
unadjusted reliability (Winer, 1970, p. 283). 

(2) Product-Moment Correlations, for test-retest 
reliability and the validity. 

(3) Tests of differences between Pp; and P2 using indepen- 
dent samples, for validities and test-retest reliabilities 
between groups (Glass & Stanley, 1970, p.°311). 

(4) Tests of differences between Pi. and P13 using 
dependent samples, for validities in the same group (Glass & 
Stan leyypals 7 0G hps.313 hl eandvokl intses ietaniyg1064 A 

(5) Tests of differences between Pi2 and P3, using 
dependent samples, for test-retest reliabilities in the same 
Groupe (OkLin; 919674, =penlis). 

(6) Tests of differences between alpha coefficients 
using independent samples, for alpha coefficients between 


groups (Feldt, 1969). 
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CHAPTER IV 


RESULTS 


The contents of this chapter are grouped in two 
main sections. The first section deals with the reliability 
of test scores. Two types of reliability were obtained in 
this study: internal consistency in terms of the alpha 
coefficient, and test-retest reliability. These two cases 
will be treated separately since their characteristics are 
rather distinctive. The second section deals with the 
validity of test scores. Two types of criterion scores 
were used: school achievement based on Language Arts, 
Mathematics, and Science; and aptitude test scores from 
vocabulary and mathematics aptitude tests. The first part 
of the validity section deals with validities of tast scores 
under different scoring methods when school achievement is 
the criterion. The second part examines validities with 
respect to aptitude scores. Finally, there is a discussion 
of the entire validity problem. 

In all cases, the results within each group were 
examined first and then a comparison was made across 
groups among values chosen as the best from the first stage 
including those from the conventional method. The test 
scores and results obtained from the initial test session 


were used as major indicators of the effectiveness of the 
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experimental techniques. Data from the second test 
session were used mainly for test-retest reliability and 
aS supporting evidence for the initial results. 
The following symbols are used in all tables in 
this chapter: 
VA = Vocabulary Test, Form A 
RA = Mathematics Aptitude Test, Form A 
VB = Vocabulary Test, Form B 
RB = Mathematics Aptitude Test, Form B 
CV = Conventional Testing Method 
CF = Confidence Testing Method 
EL = Elimination Testing Method 
CVS, = Conventional Scoring Technique 
under the Conventional Testing Method 
CVS. = Conventional Scoring Technique under the 
Elimination Testing Method 
DWS = Differential Weighted Scoring Technique 
under the Conventional Testing Method 
CWS,_. = Confidence Weighted Scoring Technique, 
functions 1 to 5, under the Confidence 
Testing Method 
ELS = Elimination Scoring Technique under the 


Elimination Testing Method 


B 


Language Arts 


Mathematics 


5 


SC = Science 


AV = Academic Average 
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Reliability 


Coefficient of Internal Consistency 

Coefficients of; internal consistency, 1i.e.,, alpha 
coefficients, were obtained by the analysis of variance method 
(Winer, 1970, p. 289) since this procedure could be applied 
to all of the different scoring techniques used in this study. 
Because there was no procedure available to test the signi- 
ficance of a difference between two alpha values within the 
same group, the comparison among them was made on the basis 
of their manifest values and the consistency of the results 
across tests. A comparison of alpha values across groups was 
made by a procedure suggested by Feldt (1969). In all cases, 
critical values for significant difference were those: for a 
two tailed test. 

Conventional Testing Group. There were two scoring 
techniques used under this testing method: Differential 
Weighted Scoring and Conventional Scoring. Alpha coeffi- 
cients obtained from test scores under these scoring techni- 


ques are shown in Table 10. 


The results in Table 10 indicate that alpha coef- 
ficients of test scores obtained under the DWS technique were 
consistently higher than those under the CVS, technique. 
However, Since the CVS, technique was used as a baseline for 
all comparisons, values from both the DWS and CVS, techniques 


were retained for further comparisons across groups. 


Confidence Testing Group, Five scoring functions, 
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TABLE 10 


ALPHA COEFFICIENTS IN CV GROUP 


Scoring Technique 


TABLE 11 


ALPHA COEFFICIENTS IN CF GROUP 


Scoring Technique 


65 
labelled CWS, to CWS,, were employed with the tests given 
to the Confidence Testing group. Alpha coefficients obtained 
from test scores using these scoring functions are shown in 
Table 11. Table 12 shows the rank orders of these values 
from the highest to the lowest within each test. 

Table 12 shows that the rank orders of all five 

scoring functions were very consistent across tests. There 
was only one exception, RA), where alpha values from the CWS, 


and CWS. changed their ranks. In all cases the alpha values 


2) 


under the CWS, were highest, those under the CWS, ranked 


4 


second, and values from CWS. were lowest. Alpha values from 


the CWS, and CWS, functions may be considered as ranking 


ih 3 
third and fourth respectively. Since the alpha values under 


the CWS, and CWS, were consistently high relative to all 


4 3) 
others, these values were retained for further comparisons 
across groups, i.e., there were two scoring techniques chosen 
from this group for further study. 

Elimination Testing Group. There were two scoring 
techniques used with this testing method: Elimination Scoring 
and Conventional Scoring. Alpha coefficients obtained from 
test scores utilizing these techniques are shown in Table 13. 


The CVS. technique gave consistently higher values than did 


2 
the ELS technique in all tests. However, since the ELS 
technique is one of the distinctive techniques used in this 


study, values from both the ELS and CVS. techniques were 


retained for further comparisons across groups. 


A Comparison of Alpha Coefficients Across Groups. As 


2WD six mort eoulsv sigis eaaty » AA ,moitgesxe so yinto 2sw i 
eoulev sfiqis sft acesd ile nt .eAns1 tied+ bepnads 20D 7 nan 
i i, 


beanss gawd adit 49bny seodst _daadpin stow , awd ons zebav <c 
mos? asu tay siqla geauns Stow ceWo moxt aouisy bis vba0988 


{ 


paianes es borebienop sd yam enoltonut a2Wd brs peW> ¢ ods a4 


ae 


tebry aeulsyv sdqgis ads sgnte -Ylevisoeqaes dt2vot bas bids 
[fs ot svissisr dpid yis tnajetenoo 915W ped bas amo odd 
esoetraqmon xsitiet s0* bontstox si1aw aoulev seeds e890 

nevodo aoupindset phitoze owt sisw exons mee \equorp snoae 
cas sordsu3 © is said ebdat me 

paixooe ows exow siscT a 

pnizrose aoitsenimily :borgom patiesd: akdd ‘ise enw: | 
mo t batistdo esnpioitteoy sdqiA Jpntxooe “teaokdaewned eee 

.€f ofdsT mi nwode oxs satbses, seait bakebitty © a s : 3 
bih asis aoulsy tsdpis viduepeisaoe, bie eupinctons g2ND ai i | 

add et eonte tevewou sed2et fis ae pindoes paka 

aids mk bee eoupindoed avidoniseie ons to rn at 
as 


66 


TABLE 12 


RANK ORDERS OF ALPHA VALUES FROM THE SAME TEST 


Scoring Technique 


TABLE 13 


ALPHA COEFFICIENTS IN EL GROUP 


a 


Scoring Technique 


Test 
ELS CVS 
Zz 
VA, meal 
VA, iS 
RA, -602 
RA -670 
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67 
a result of the comparisons within each group, alpha values 
from six scoring techniques were retained for comparisons 


across groups. These were: the DWS and CVS, techniques in 


(i 
EhesCV regroup; the CWS , and CWS. in the CF group; and the ELS 
and CVS. techniques in the EL group. Alpha values obtained 


from these techniques were compared within each test, VA, and 
RAJ, using a test of significant differences between two 

alpha values suggested by Feldt (1969). The values from VA, 
and RA, tests were not considered here since test scores from 
the second test session were obtained for the purpose of test- 
retest reliability. 

Since the procedure for testing the significance of 
differences between two alpha values made use of the F dis- 
tribution and the problem was concerned with a two tailed 
test P@lehe probabilities#of the critical ivalues of if fatd.005 
and .025 levels were doubled to make the probabilities for 
the present test .01 and .05 respectively. The degrees of 
freedom for all tests were between 300 and 350. There was 
no F table that gave degrees of freedom in this neighbourhood, 
Thegeritical values of#F used sin these) tests were calculated 
for each pair of the degrees of freedom. These values, with 
their degrees of freedom, are shown in Table 14. When uSing 
these critical values, the first degree of freedom is that 
of the group with the larger value of alpha in the comparison. 
Results of these tests for VA, are shown in Table 15, and for 


RA, in Table 16. 
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TABLE 14 


CRITICAL VALUES OF THE F-RATIO WITH DIFFERENT DEGREES OF 
FREEDOM FOR TESTS OF DIFFERENCES BETWEEN ALPHA 
COEFFICIENTS ACROSS GROUPS 


Comparison Degrees of* P-rattorat P—Saclowat 


Groups Freedom -05 level -Ol1 level 


CV=CH 345,347 ioe a0 (hey Si RE) 
Cr-CV 347,345 AAP te eee ie 0 
CV—El 345,333 1.2380 Leese 
EL-CV 333,345 L237 1.3234 
CF-EL 347,333 es tr JERS eae) 


Booey, 333,347 peo wel 1.3228 


* 


d.f.1 is always from the group which has a larger value 
of alpha coefficient in the comparison. 
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TABLE 15 
F-RATIO OF COMPARISONS OF ALPHA COEFFICIENTS FROM VA, 
BETWEEN SCORING TECHNIQUES ACROSS GROUPS 
Scoring DWS CVS] CWS3 CWS4 ELS CVS92 


Techniques} (.67084) (.62718) (.76535) (.77371) (.64712) (.74188) 


kk kk * 
DWS ~ + PeA028 9) 14540 ele072ie  slacioe 
kk kk kk 
Cvs, ~ 1.5888 1.6475 1.0565 1.4444 
kk 
CWS. - + 1.5039 1.1000 
kk k* 
CWS, ~ 1.5594 1.1406 
ELS = + 
cvs, - 
TABLE 16 
F-RATIO OF COMPARISONS OF ALPHA COEFFICIENTS FROM RA, 
BETWEEN SCORING TECHNIQUES ACROSS GROUPS 
Scoring DWS CVS] CWS3 CWS 4 ELS CVS> 


Techniques | (.70893) (.59590) (.70857) (.71493) (.56462) (.60152) 


ak ** 

DWS - + 1.0012 1.0210 1.4958 1.3690 
kk xk 

cvs, ~ 183866.ficiK4175 hiahew74than.omn 

** ** 
CHS. ~ + 1.4939 1.3673 

a* ** 
CWS, ~ i5278an akBoe78 
ELS - + 
cvs, z 


¥ 
Significant at .01 level. significant at .05 level. 
+ no test available. 
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To gain a more meaningful picture of the results 
from these two tables, the information with respect to the 


Significant differences are presented schematically as 


follows: 
VA, Test 
CWS) CWS, CVS. ELS DWS CVS, 
RA, Test 
CWS, CWS, DWS CVS, CVS. ELS 


Scoring techniques underlined by a common line do 
not differ from each other; those not underlined by a common 
line do differ. In this case, techniques from the same group, 
e.g., the DWS and CVS, are from the CV group, are also under- 
lined by a common line since there was no test for the 
differences between the values from these techniques. 


The results of VA, test show that, among six values 


of alpha coefficient, those from the CWS yy CWS3, and CVS. are 


2 


Significantly higher than those from the ELS, DWS, and CVS,- 


The results of RA, test are similar. The alpha values from 


the CWS CWS and DWS are significantly higher than those 


4’ 3a 


from the CVS), CVS and ELS. In both tests, the alpha 


2! 
coefficients from the CWS, and CWS , are higher than others 
including the values from the typical conventional scoring 
techniques, the CVS, - Thus the results from these two tests 
indicate that the CWS , and CWS, provide the most reliable 


test scores in terms of the internal consistency. 


4-7 fil. 
> oe - : 
9 


- 7 
ae 


ob snif commoo s yd Beni trebay geupiadoes pais00@ 45 
nommos 4 yd barebones: son Saend + rerito dogs mom 3922ib ton. 
\quore emse: ada mo adie kanes ,een0 eldt oI .aetib ob ontt 
-uebanu oels sts (aePa VO sis mort sts -2V9 bas @wa edt B® 
efit x0? test’ on esw eens sonia oil aommoo # yd bemkt 
-soupinioas oasis poet. esulev odd mesuded eeonste32ib i. : 


. The 
eeulsv xip proris ails ver aeet ,AV Yo saluaer od oe 


ets g2V9 — ie) pana shit mort sya \3eks2tteeo miei 
1g2V2 bas . end cht e ans ito aedrt redd ‘sodphdt yltnsortiapie et 
mort goulsy stigie edt - stslimbe sxs teed jy? to edtuseriedt 1% 
sao msit soipid yidasoitinvie, exe 2nd bas 1 gaWo p2Wor ents ot 
Siiqie 963 ,etest dtod al” Baa ‘bas ge 429 ont moa 
Se ee 
f eles aon pees an thutoak 


_ 
7s 
_ 
- 


72 

Summary of Results 

In this section, the focus of the study is on the 
selection of the scoring techniques which provide test scores 
with high internal consistency in terms of their alpha 
coefficients. A comparison among the alpha values was made 
within each group and then across groups. It was shown that 
the alpha coefficients from test scores under the CWS, and 
CWS , techniques from the CF group were higher than those values 
under the other techniques including the CVS), the typical 
conventional technique. The analyses show no distinction 
among the alpha values from the DWS, CVS), ELS, and CVS. 
techniques. Thus, the results indicate that the CWS, and CWS, 
experimental techniques provide more reliable test scores 
in terms of the internal consistency than does the conven- 


tional technique. 


Test-Retest Reliability 

Test-retest reliability was measured using a product- 
moment correlation between test scores from the first and 
the second test sessions with each scoring technique. Tests 
of the significance of differences between two r's were avail- 
able for both dependent and independent cases. A procedure 
SUUGeSLeG ti Ok Pin sedlLcileler(lJ0/, PD. blo) was UuseG Lor Ls 
in the same group, and Fisher"s z transformation and test was 
used for r‘'s across groups (Glass & Stanley, 1970, p. 311). 
In all cases, the critical values used were for two tailed 


tests. 


“—- Toes: 


0 


gies pete 20 ert at Nasal: canal gk i 

eben ew sovisv sdqis ed? padms coetieqmoo A .admetoktieen 
tant mwore enw 31” Laquowp seoxse aaKt bas quow dake abdstw 
bas aw? ods sebnt astose jesJ ox? edtisiol 22900 sfigis eft 
neulev seods nats tofetd stew quotp FD eft mor? aeuptiarioes g2Wo 
fsoiqys edt <,aVo ett patbulort eoupintosd xeijo edt sebnav 
noeitonisetb on wore asavlens eit .eupindoss Isaotsimevnoo 
gaVvd bas ,ade , /2V9 <2wd sd+ mori esulsv adgie en proms 
pend bis ews ent tas essoibni atiuee1 odd and?  .aeepinoes | 
es7oO0e +263 oldsilox ecscalk sbivorg seupianoes Lednoeminegxe 


-qevitoo sit 280b nat yonodekenoo Isnzetat oft to amet ak 


-touwboxd 6 patay betuenom asw ys tlidsiiss jasder-Joor 


a 


bas text? eft mort 94008 seas neswied coissie1109 $nemom 
ejesT .supianost oni t0o8 floss Attw anoiesse +aot ‘Reel 
-~fisvs oiew 2" 2 cme “Heawied eooustot2ib 4e eonkatiagte afd 0 x 
exubencitg A .80869 tnobascebnd ae tnabnsges cited 30h otds stds 
e'x tot beey esw (ett - Teen) elotdts e' abide at beveoneue 


aew seed bas Pei ee ene Saal omae = ak 


Ve 


Conventional Testing Group. There were two values 
of test-retestyrelilability™ for* each’ test, one” for test scores 


under the DWS technique and one under the CVS, technique. 


dl 
Table 17 shows these values and results of a test of signi- 


ficance of the difference between each pair. 


TABLE 17 


TEST-RETEST RELIABILITIES IN CV GROUP 


Scoring Technique 


Test 
DWS CVS, z 
VA -620 - 680 1.0009 
RA e5a0 -607 1.1646 


There were no significant differences between 
reliabilities under the different scoring techniques. 
However, the CVS, technique provided higher values than did 
the DWS technique in both tests. These results are different 
from those obtained using alpha coefficients. For the pur- 
pose of a comparison across groups, the reliabilities under 


the CVS, technique were retained for further study. 


1 
Confidence Testing Group. There were five values 
of test-retest reliability for each test resulting from the 
use of the five scoring functions in this group. Table 18 

shows these values, and Tables 19 and 20 show the results 


of tests of significance of the differences between reli- 


abilities using the different scoring functions. 
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TABLE 18 


TEST-RETEST RELIABILITIES IN CF GROUP 


Scoring Technique 


TABLE 19 


RESULTS OF COMPARISONS OF TEST-RETEST RELIABILITIES 
OF VA TEST SCORES (TABLED VALUES ARE z's) 


Scoring CWS} CWS2 CWS3 CWS4 CWSs, 
Techniques (.764) (#715) ann 2) Ga 977 A153) 
CWS, cs 1.0236 0.6965 0.8069 0.2429 
CWS., = 1.7407 1.9000 0.7772 
CWS, = 0.1268 Uses BI) 
CWS , = 1.0615 
Cws = 
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TABLE 20 


RESULTS OF COMPARISONS OF TEST-RETEST RELIABILITIES 
OF RA TEST SCORES (TABLED VALUES ARE z's) 


Scoring CWS, CWS 9 CWS3 CWS 4 CWS5 
Techniques (.669) (.648) (.684) (.688) (.667) 
CWS, = 0.3475 Ueeoue OS o815 070337 
CWS. = 0.6145 0.6914 0.3134 
CWS, a OCT 05 0.2942 
CWS , = 0.3678 
CWS, = 


There were no Significant differences among the 
reliabilities obtained using the different scoring functions 
for both VA and RA tests. However, since the purpose of 
this comparison was to choose the one that provided the 
highest reliability, a rank ordering of these values was 
used to make the selection. Table 21 shows the results of 


ranking the reliabilities from the same test. 


TABLE 21 


RANK ORDERS OF RELIABILITIES WITHIN EACH TEST 


Scoring Technique 
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The results in Table 21 show perfect agreement 
between tests. The reliabilities of the CWS, ranked first 
and so on to those of the CWS. which ranked last. This 
result was consistent with that for the alpha coefficients. 
The reliabilities of test scores under the CWS, and CWS, 
techniques were thus retained for further comparisons. 

Elimination Testing Group. There were two values of 
test-retest reliability for each test, one for test scores 
under the ELS technique and one under the CVS, technique. 


Table 22 shows these values and results of a test of 


Significance of a difference between each pair. 


TABLE 22 


TEST-RETEST RELIABILITIES IN EL GROUP 


Scoring Technique 


0.1494 


0.1469 


There were no Significant differences between 


reliabilities under the different scoring techniques. This 
result was the same as in two previous groups. There was 
also a consistent indication that one technique, the ELS, 
provided higher reliabilities than did the other, the CVS... 
Therefore, the reliabilities under the ELS technique were 


retained for further comparisons. This result was not the 


same as in the case of the alpha coefficients. 
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A Comparison of Test-Retest Reliabilities across 


Groups. As a result of comparisons within each group, 
reliabilities from test scores under four scoring techniques 
were retained for this comparison. These were: the CVS, 


technique in the CV group, the CWS, and CWs- net Des Ce soroup, 


5 
and the ELS technique in the EL group. Reliabilities of 

test scores under these scoring techniques were compared 
within each test. A test of significance of the differences 
between each pair of values was made using Fisher's z trans- 
formation and test (see Glass & Stanley, 1970, p. 311). Table 
23 shows the results from the VA test and Table 24 from the 


Ra test. 


TABLE 23 


RESULTS OF COMPARISONS OF TEST-RETEST RELIABILITIES OF VA 
TEST SCORES ACROSS GROUPS (TABLED VALUES ARE z's) 


Scoring 


Techniques Galit2) Cha (2.691) 


kk kk 
3.2594 3.4230 0.2206 


a k* 
- 0.1268 3.4013 


kk 
Se O1LS2 


* 


* 
Significant at .01 level. 


+ z's value obtained from the comparison within group. 


Results of the test of differences in the two tables 


were not quite the same. Reliabilities from the VA test 
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TABLE 24 


RESULTS OF COMPARISONS OF TEST-RETEST RELIABILITIES OF RA 
TEST SCORES ACROSS GROUPS (TABLED VALUES ARE z's) 


Scoring 


CWS 3 
(.684) 


Techniques 


Ane ste) Seed Us Ubsrey | 


= OO 7055 EIA ST1G 


* 
10 


Sigu¢picant aL. .05 level: 


+ : ; re ee 
z's value obtained from the comparison within group. 


scores differed from each other more than did reliabilities 
from RA test scores. To gain a more meaningful picture of 


these results, the following summary of information was made: 


VA Test 

CWS, CWS 3 CVS, ELS 
RA Test 

CWS, CWS , CVS, ELS 


In the above schemes, scoring techniques underlined 
by a common line do not differ from each other, and those not 
underlined by a common line do differ. It is evident that 
the reliabilities of test scores under the CWS , and CWS, 


techniques were, in general, higher than those under the 
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CVS, and ELS techniques. The reliabilities under the ELS 
technique were likely to be the lowest. The result that the 
CWS, and CWS, techniques provide test scores more reliable 
than do the other techniques was the same as in the case of 


the alpha coefficients. 


Summary of Results 

The purpose of the study in this section was to 
select the scoring technique which provide test scores with 
highest test-retest reliability. A comparison among the 
reliabilities was made within each group and then across 
groups. The results were consistent with the alpha coeffi- 
cients. The results show that the CWS, and CWS, techniques 
provided more reliable test scores than did the other tech- 
niques used in this study. Since the CVS, technique was the 
typical conventional technique, the results also indicate 


that the CWS , and CWS, are better techniques in terms of 


3 
the test-retest reliability than the conventional technique. 


Validity 


There were two types of criteria used in the study 
of the validity of test scores. They were: School Achieve- 
ment and Aptitude Test Scores. It was suspected that test 
scores under the experimental techniques would correlate 
differently with different types of criteria. For this 
reason, the results of the analysis were grouped according 


to types of criterion scores: the validity with School 
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Achievement, and with Aptitude Test Scores. Tests of the 
differences between validities within the same group were 
done by a test of the differences between two correlations 
with dependent samples (see Glass & Stanley, 1970, and Oklin 
& Siotani, 1964). Tests of the differences between validi- 
ties across groups were done by using the Fisher's z trans- 
formation and test (Glass & Stanley, 1970, p. 311). It is 
noted that the significance of the difference between two 
validities within the same group depends not only on the 

two values but also the correlation between test scores from 
which the validities are obtained. Thus, the same difference 
does not necessarily mean the same level of significance. In 


all cases, the critical values used were for two tailed tests. 


School Achievement Scores as Criteria 

School Achievement scores used in this study were 
obtained from three subjects: Language Arts, Mathematics, 
and Science. An academic average was also obtained and used 
for test validity calculations. There was no evidence to 
show how reliable these scores were, since they were collec- 
ted and recorded by schools in terms of total scores. 
However, a consideration on inter-correlations among these 
scores within each group suggest that they were reasonable 
criteriageor this study. ‘Tables 25, 26, and 2/ >show inter— 
correlations among the four scores within each group. 

Results from these tables indicate the same pattern 


of intercorrelations among the three groups. Tests of the 
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TABLE 25 


INTERCORRELATIONS OF SCHOOL ACHIEVEMENT 
SCORES IN CV GROUP 


Subjects 


TABLE 26 


INTERCORRELATIONS OF SCHOOL ACHIEVEMENT 
SCORES IN CF GROUP 


Subjects LA MA SC AV 
LA = els -608 Sifts 
MA - sO 7335 
SC = nSs51 
AV e 
TABLE 27 


INTERCORRELATIONS OF SCHOOL ACHIEVEMENT 
SCORES IN EL GROUP 


Subjects 
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differences among the correlations using the same criterion 
across groups were also made using Fisher's z transformation 
and test (Glass 7s sStanley, 1970, %p: 311)= No statistically. 
Significant differences were obtained. The evidence suggests 
the comparability of these three groups in terms of school 
achievement and indicates that these achievement scores 
could be used as criteria for any of the groups. 

The following are the study of the validities of VA 
and RA test scores under each test-taking method when school 
achievement scores were used as criteria. Tests of the 
differences between validities within the same group were 
done by a test of the difference between two correlations 
with dependent samples (see Glass & Stanley, 1970, p. 313, 
and Oklin & Siotani, 1964). 

Conventional Testing Group. Table 28 shows the 
validities of test scores under two scoring techniques, the 
DWS and the CVS, - Table 29 shows the results of tests of 
the differences between validities of the same test under 
different scoring techniques. 

Only two values from VA test scores attained statis- 
tical .significance; but all iwére significant in the ‘case of 
RA test scores. Since all significant differences were in 
favour of the validities under the CVS, technique, it was 
evident that the validities of test scores under the CVS, 
technique were, in general, higher than those under the DWS 
technique. Therefore, the validities under the CVS, tech- 


nique were retained for further comparisons across groups. 
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TABLE 28 


VALIDITIES OF TEST SCORES WITH SCHOOL 
ACHIEVEMENT IN CV GROUP 


Scoring Criteria 


Technique 


TABLE 29 


DIFFERENCES BETWEEN VALIDITIES UNDER DWS AND CVS 
SCORING TECHNIQUES (TABLED VALUES ARE z's) 


1 


Differences Between Values 


Oi 7S 


0.4829 O23 06 


« * 
VA, 2.0420 1.3838 20802 Ly. 7262 
kk * x kk 
RA, Ag5193 CoAU57 6.8931 5.93 
kk kk kx * 
RA. 4.6771 6.9846 645592 6.6808 


ShHoniricant vat .UL level. 
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Confidence Testing Group. Table 30 shows the 
validities of test scores under five scoring functions, the 
CWS, co CWS... Tables 31 to 38 show the results of tests of 
the differences between validities of test scores from VA, 
and Tables 39 to 46 from RA, one table for each criterion 
measure. 

Since there were many pairs of values and tables 


involved, a consideration of the results was made with each 


test first followed by a general evaluation. 
Validities of VA Test Scores 


1. The results of the tests of differences in Table 
31, when LA was a criterion, were confounded. No summary 


could be made. But the results in Table 32, for the VA 


2 
test scores, were clear. The highest validities were 
obtained for CWS,, followed by CWS, and CWS, , then CWS,, 
and finally CWS,- 


In both tables there were indications that the 


validities under the CWS, were significantly higher than 


3 
those under the CWS). 

2. There was only one pair of comparisons that attained 
statistical significance when MA was a criterion. This was 
the difference between the validities under the CWS, and 
CWS,- It is apparent that the validities under all scoring 
functions in Tables 33 and 34 are generally comparable. 


3. The results in Tables 35 and 36 were the same as 


those in Tables 33 and 34, except that one more pair attained 
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TABLE 30 


VALIDITIES OF TEST SCORES WITH SCHOOL 
ACHIEVEMENT IN CF GROUP 


Criteria 
z 


-410 
- 408 


Scoring 
Technique 


-406 
303 
-407 
-486 
~497 
-462 
-450 


~479 


- 466 
-458 
-469 
- 467 
-464 


-441 
~435 


~442 
439 


~442 
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TABLE 31 


DIFFERENCES BETWEEN VALIDITIES OF VA, TEST SCORES 
WITH LA (TABLED VALUES ARE z's) 


CWS CWS CWS CWS CWS 


Scoring 1 2 3 4 5 
Be 
echniques (.336) (.342) (.323) (.322) (.334) 
* 
CHS, 7 0.3385 1.4822 1.9983 0.4427 
CMs. zs 1.0776 1.4476 0.9477 
** 
CHS, = 2.7846 0.9531 
CHS, = 0.4794 
CMs, . 
TABLE 32 


DIFFERENCES BETWEEN VALIDITIES OF VAg TEST SCORES 
WITH LA (TABLED VALUES ARE z's) 


Scoring CWS, CWS., CWS, CWS, CWS, 
Techniques (.426) (.450) (.388) (.374) (.419) 
* *k* ** 

CWS, ss 2.3669 3.8454 4.1451 1.0735 
*k* ** ** 

CHS, = 3.4663 3.6576 3.4531 
kk ** 

CHS, = 3.6175 2.6623 
** 

CWS, e 3.0380 

CWS, = 


Significant at .01 level. Significant at .05 level. 
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TABLE 33 


DIFFERENCES BETWEEN VALIDITIES OF VA, TEST SCORES 
WITH MA (TABLED VALUES ARE z's) 


Scoring CWS, CWS > CWS. CWS, CWS5 

Techniques | (333) (.332) (332) (.320) (.332) 

CWS, = 0.0563 0.1143 1.0846 0.2212 

CWS. , 0 0.5791 0 

*K* 

CWS, 2 3.0444 0 

CWS, . 0.8083 

CWS, . 

TABLE 34 


DIFFERENCES BETWEEN VALIDITIES OF VA2 TEST SCORES 
WITH MA (TABLED VALUES ARE z's) 


Scoring A GS es an ey 
Techniques} (392) (.390) (.383) (.377) (.385) 
CHS, £ 0.1934 0.9110 1.1989 1.0560 
CHS, 2 0.3879 0.6212 0.5490 
CWS, bs 1.5600 0.1710 
CWS, 2 0.5390 
CWS, c 


Significant at .01 level. 
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DIFFERENCES BETWEEN VALIDITIES OF VA] TEST SCORES 


WITH SC (TABLED VALUES ARE z's) 


CWS, CWS. CWS CWS 


Scoring 3 4 
Techniques (. 400) (.396) (.397) (.386) 

CHS, z 0.2317 0.3225 1.2007 

cus, . 0.0583 0.4963 

** 

CHS, z 2.8639 

CWS, ~ 

CWS, 

TABLE 36 


CWS. 
(392) 
1.8134 
0.4854 
0.4456 


0.4151 


DIFFERENCES BETWEEN VALIDITIES OF VA2 TEST SCORES 


WITH SC (TABLED VALUES ARE Z's) 


Scoring } 2 3 4 
Techniques (. 468) (. 473) (.453) (.445) 
CHS, a 0.5046 1.5749 1.9045 
CWS, 2 1.1540 1.3922 
* 
CWS, " 2.1480 
CWS, : 
CWS, 


Significant at .01 level. Significant at .05 level. 
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TABLE: 37 


DIFFERENCES BETWEEN VALIDITIES OF VA] TEST SCORES 
WITH AV (TABLED VALUES ARE Z's) 


Ee CWS, CWS. CWS. CWS, CWS, 
Seine 
echniques (.410) (.408) (.406) (.393) (.407) 
CWS, = 0.1165 0.4122 14622 0.6854 
CWS. 4 0.1173 0.7479 0.1222 
CWS, x 3.3888 0.0897 
CWS , z 0.9731 
CWS, 2 
TABLE 38 


DIFFERENCES BETWEEN VALIDITIES OF VA2 TEST SCORES 
WITH AV (TABLED VALUES ARE z's) 


CWS, CWS. CWS, CWS, CWS. 
(.450) (.479) 

* kk 
2.9309 2.9870 1 13L06 


* * * 

CWS. = 2.0356 Zea 310 2.0802 
kk 

CWS, = CAR 1 LOO 

* 

CWS, = 2.0356 


Significant at .01 level. Significant at .05 level. 
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TABLE 39 


DIFFERENCES BETWEEN VALIDITIES OF RA] TEST SCORES 
WITH LA (TABLED VALUES ARE z's) 


Scoring CWS, CWS. CWS CWS, CWS, 

Techniques (.304) (.304) (.300) (.297) (.303) 

cws, : 0 0.6188 0.7655 0.2528 

cws. = 0.3252 0.4672 0.4378 

cwS, : 0.9269 0.3390 

CWS, 3 0.5052 
CWS, ae 

TABLE 40 


DIFFERENCES BETWEEN VALIDITIES OF RA9 TEST SCORES 
WITH LA (TABLED VALUES ARE z's) 


Scoring CWS, CWS, CWS, CWS, CWS. 
oe oe eo a) (.272) ($272) (2271) (4375) 
CWS, 083066 023298 0:8477 40793505 
cws, = 0 0f6677 2085312 
cws, 0.4334 0.3360 
CWS, “ 0.3470 
cws ~ 


90 


TABLE 41 


DIFFERENCES BETWEEN VALIDITIES OF RA, TEST SCORES 
WITH MA (TABLED VALUES ARE z's) 


Scoring CWS, CWS 5 CWS3 CWS, CWSs 
Techni 
ft We edge) (.489) (.502) (.499) (.499) 
cws, z 1.6923 0. 5223 0 0 
cws., a 1.1605 0.7338 46801. 
CWS, 2 120911 G.3736 
CWS, - 0 
CWS, 7 
TABLE 42 


DIFFERENCES BETWEEN VALIDITIES OF RA2 TEST SCORES 
WITH MA (TABLED VALUES ARE Z's) 


CWS 


Scoring a 2 3 4 5 
Techniques | (479) (. 469) (. 483) (.479) (.484) 
CWS, . 1.6714 0.7193 0 1.3717 

** 

cws, L 1.2340 0.7417 2.8833 
cws, z 1.8965 0.1232 
cws, = 0.4767 

CWS, 


** Significant at the .01 level. 
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TABLE 43 


DIFFERENCES BETWEEN VALIDITIES OF RA, TEST SCORES 
WITH SC (TABLED VALUES ARE z's) 


Scoring CWS, CWS. CWS, CWS y CWS. 

Techniques (. 426) (.417) by £33) (.433) e421) 
CWS, - 1.4620 bLt425 0.8078 Vi 3247 
CWS, - b.3695 526 i b.333) 
CWS, ~ 0 b.4233 
CWS, - bh. 6GOGE 
CWS, = 

TABLE 44 


DIFFERENCES BETWEEN VALIDITIES OF RAg TEST SCORES 
WITH SC (TABLED VALUES ARE 2's) 


Scoring CWS), CWS. CWS, CWS, CWS, 
Becrnd ques | (1430) (. 422) (.435) (. 434) (.429) 
cws, - 1.4650 0.6996 0.37n2 0.5335 
CWS, s i, UD53 0.8665 1.3159 
CWS, - 0.4631 0.7169 
CWS , - 0.4630 
CWS ~ 
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DIFFERENCES BETWEEN VALIDITIES OF RA, TEST SCORES 
WITH AV (TABLED VALUES ARE z's) 


Scoring CWS, CWS. CWS 3 
Techniques (. 466) (.458) (.469) 
CWS, - 13289 0.5007 
CWS, - 0.9627 
cws, o 
CWS, 
CWS, 
TABLE 46 


CWS 4 


(.467) 
0.1180 


0.6474 


0.6673 


CWS. 


(.464) 
0.5440 

x 
PRES HOA OS: 
0.6093 


Om 26 


DIFFERENCES BETWEEN VALIDITIES OF RA g TEST SCORES 
WITH AV (TABLED VALUES ARE z's) 


efotar eats, CWS, CWS. CWS 3 
Techniques (.441) (.435) (.442) 
CWS, ~ 0.9840 O17 58 
cws, - 0.6042 
cws , _ 
CWS, 
CWS, 


Significant at .01 level. 
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statistical significance. With SC as a criterion, the 


differences between the validities under the CWS, and CWS, 


were Clearly in favour of the CWS All other values were 


3° 
comparable. 

4. The results in Table 37, when AV was used as a 
criterion, showed only one pair attaining statistical signi- 
ficance. This was the difference between validities under 
3 and CWS ,- The results in Table 38 were clearer than 
those in Table 37. The validity under the CWS 


the CWS 
2 was Signifi- 
cantly higher than all except the CWS, function. The 


validity under the CWS, seemed to be the lowest. The 


4 
difference between validities under the CWS, and CWS, also 
attained statistical significance. 

The results of tests of the differences between 
validities of test scores from VA were generally inconsistent, 
except for the difference between those values under the CWS , 
and CWS, which favoured the CWS3. No scoring function pro- 


vided consistently high validities when school achievement 


scores were used as the criteria. 
Validities of RA test scores 


The results of tests of the differences between 
validities of RA test scores are readily examined. Among all 
pairs of comparisons, there were only three which attained 
statistical significance. These were the differences between 
validities under the CWS. and CWS, in Tables 41, 42 and 45. 


It is apparent that no scoring function provided 
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higher validity than did the others. The only exception 
was a tendency for the validities under the CWS, to be higher 
than those under the CWS., especially in the case where MA 
was a criterion. 

Since there was no scoring technique showing its 
superiority over the others in terms of validity when school 
achievement was the criterion, the validities under these 
scoring techniques were retained for comparisons with those 
validities from the other groups. 

Elimination Testing Group. Table 47 shows the 
validities of test scores under two scoring techniques, the 


ELS and CVS and Table 48 shows the results of tests of the 


a! 
differences between validities of test scores under these 
scoring techniques. 

The results of the comparisons between validities of 
test scores for VA are apparent. All pairs attained statis- 
tical significance consistently showing the superiority of 


the ELS technique over the CVS, technique. The results in 


2 
the case of RA test scores were, however, less clear but 
still consistent in the case of RA,. All significant 
differences in this case were also in favour of the ELS 
technique. It is likely that these two scoring techniques 
affected VA test scores more than RA test scores. For the 
purpose of a comparison across groups, the validities of test 


scores under the ELS technique were retained for further 


study. 
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TABLE 47 


VALIDITIES OF TEST SCORES WITH SCHOOL 
ACHIEVEMENT IN EL GROUP 


Criteria 


SCOring 
Technique 


TABLE 48 


DIFFERENCES BETWEEN VALIDITIES UNDER ELS AND CVS 
SCORING TECHNIQUES (TABLED VALUES ARE z's) 


2 


Differences Between Values 


kk * kK * 
4.0368 308382 S625 Hes 531 


kx kx kk kx 

VA, 4.4973 4.8701 5.2495 39195 
RA, 1M wy fy 1.4290 a Bees pee 1.7433 
kk x * kx 

PAAT Me he he) Ama S 2a LS 3435980 


Stonicicant at .01 Level. Significant at.05 level. 
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VB and RB Test Scores as Criteria 

VB and RB tests were given to all students at the 
second test session. These tests were administered and 
scored under the conventional method. Therefore, the alpha 
reliabilities were computed from the total group. These 
were .778 for VB and .563 for RB. Table 49 shows the 
correlations of test scores from these two tests with School 


Achievement within each group. 


TABLE 49 


CORRELATIONS OF VB AND RB TEST SCORES WITH SCHOOL ACHIEVEMENT 


School Achievement 


Test 

z x 
CV-VB 34:70 ah es: ~442 age Ue) 
CF-VB 394 sooe -447 -476 
EL-VB ag ek eet: ee A 7455 
CV-RB e235 eh: - 348 - 368 
CF-RB males p- Pe HLS age il - 306 
EL-RB Reais a3 LG 2 5242 


Tests of the differences between the correlations 
of the same test with each criterion were made using Fisher's 
Zz transformation and test (Glass & Stanley, 1970, p. 311). 
No difference attained statistical significance. This suggests 
the comparability of VB and RB test scores as criteria among 


the three groups. 
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Details are given below of the study of the validi- 
ties of VA and RA test scores under each test-taking method 
when VB and RB test scores were used as criteria Tests (or 
the differences between validities within the same group 
were done by a test of the difference between two correlations 
with dependent samples (see Glass & Stanley, 1970, p. 313, 
ang OkVin & Si otani7a264). 

Conventional Testing Group. Table 50 shows the 
validities of test scores under two scoring techniques, the 


DWS and CVS and Table 51 shows the results of tests of the 


1! 
differences between validities of the same test under differ- 
ent scoring techniques. 

The results of tests of the differences between the 
validities of test scores from RA are clear. All validities 


under the CVS, were significantly higher than those under 


uf 
the DWS. However, the results from the VA test are not as 
definite. There was only one difference that attained 
statistical significance in favour of the CVS,- For the 
purpose of a comparison across groups, the validities of 
test scores from both tests under the CVS, technique were 
retained. This selection of the CVS, technique was the same 
as that when using School Achievement as the criterion. 
Confidence Testing Group. Table 52 shows the 
validities of test scores under five scoring functions, the 
to CWS.-. Tables 53 to 56 show the results of tests of 


a 5 
the differences between the validities of test scores from 


CWS 


VAPwancerables 57 to 60 £rom RA, one table Lor each criterion. 
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TABLE 50 


VALIDITIES OF TEST SCORES WITH VB AND RB 
TEST SCORES IN CV GROUP 


Criteria 


Scoring 


ace Technique RB 
VA, a242 
3302 

.376 
RA, 73.90 
-496 
RA, -481 
ToG2 


TABLE 51 


DIFFERENCES BETWEEN VALIDITIES UNDER DWS AND CVS 


SCORING TECHNIQUES (TABLED VALUES ARE z's) 


Criteria 
Test 
= 
VA, 0.4359 Osa ye 
* 
VA, 2.2030 0.9497 
xx *x* 
RA, "3 4.4328 3.5618 
*k* x* 
RA. 4.4145 26 SOe 


** significant at. .01 level. Significant at .05 level. 
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TABLE 52 


VALIDITIES OF TEST SCORES WITH VB AND RB 
TEST SCORES IN CF GROUP 


a 


Scoring Criteria 


g25° Technique 


RB 


O29 
ests) 
.307 
e295 


20 


VA HEE) 
oe 
314 
ese 


sel ye 


- 488 
2475 
493 
-494 
483 
osu 
ouwZ 
spaeh i) 
ToGo 


~o22 
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TABLE 53 


DIFFERENCES BETWEEN VALIDITIES OF VA, TEST SCORES 
WITH VB (TABLED VALUES ARE z's) 


CWS CWS CWS CWS CWS 


Scoring ui 2 3 4 5 

Techniques | ( 646) (.638) (. 636) (. 627) (.648) 
CWS, = 0.5586 1.4021 1.9389 0.5472 
cws, ss 0.1400 0.6552 1.4538 
cws, ud 0.9346 1.2826 
CWS, - 127380 
CWS. = 

TABLE 54 


DIFFERENCES BETWEEN VALIDITIES OF VA2 TEST SCORES 
WITH VB (TABLED VALUES ARE z's) 


Scoring 


Techniques (.673) (267 7) (.647) (.630) (pyc) 


kk xk 
ms Oe ee Od, eH A: 4.0793 0.1882 


if 
* 
CWS. = 1.7084 2.4497 0 
kx kk 
CWS re Dis OGG 2.6033 
xk 
CWS , i 34. 3.99.3 
CWS, 


** significant at .O1 Level. Gigniticaic at «Us. Level. 
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TABLE 55 
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DIFFERENCES BETWEEN VALIDITIES OF VA, TEST SCORES 


WITH RB (TABLED VALUES ARE z's) 


Scoring CWS, CWS. CWS, CWS, 
Pearinsques | (,4329) (8338) (.307) (.295) 
* k* 
cw, a 0.5067 2.4940 2.8151 
* 
cws, is 1.7505 2.0657 
* 
CWS, = 1.9814 
CWS, 2 
CWS, 
TABLE 56 


5 
(.828) 
0.2208 
).2ea2h 
©. 8107 

i he 
2.2082 


DIFFERENCES BETWEEN VALIDITIES OF VAg TEST SCORES 


WITH RB (TABLED VALUES ARE z's) 


aoe iiig CWS, cws,, CWS, CWS, 
Techniques | (335) (.318) (.314) 386) 
* 
CWS, - 1.5989 2.0634 1.7144 
CWS, = 0.2153 0.2323 
CWS, : 0.2536 
CWS, - 
CWS, 


Sioniticant at .02 weve. 


5 
wae) 
kk 

3.3641 
0.6398 
0.2661 


0,.0655 


Significant at .05 Level. 
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TABLE 57 
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DIFFERENCES BETWEEN VALIDITIES OF RA, TEST SCORES 


Scoring 
Techniques 


(.426) 


0.4895 


022575 


TABLE 58 


WITH VB (TABLED VALUES ARE 2's) 


(.422) 


0.8070 
074931 


193032 


DIFFERENCES BETWEEN VALIDITIES OF RAy TEST SCORES 


Scoring 
Techniques 


(.3390} 


0.5141 


0 


WITH VB (TABLED VALUES ARE 2's) 


= 
(.394) 
0.2620 
0.7404 
0.4685 


0.7254 


Ttep.0 
£foe.f 


TABLE 59 
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DIFFERENCES BETWEEN VALIDITIES OF RA, TEST SCORES 
WITH RB (TABLED VALUES ARE z's) 


CWS, 


(.488) 


Scoring 
Techniques 


CWS. CWS, 
C475) (5493) 
x 
gel / 86 0.8461 
_ dao 730 
TABLE 60 


(.494) 


0.7188 
1.3838 


OWS be ME 


5 
(.483) 
e/a 
1.5490 
ae soD 


a1 32 


DIFFERENCES BETWEEN VALIDITIES OF RA g TEST SCORES 
WITH RB (TABLED VALUES ARE z's) 


CWS, 


(2 5 0) 


Scoring 
Techniques 


Pars 
Sion Gaiu ac. ULL 


level. 


(.537) 


Lie O31 


* 
Par Ast he. 


EERE, 


1.1868 
x 
22007 


0.9886 


5 
(322) 
* 
Le 2023 
x 
Ve9816 
ley He REL 


1.6718 


Significant at .05 level. 
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The following is a summary of the results from 


Tables 53-60: 


Ll. The results of the tests on the validities’ of VA, 


test scores with VB, Table 53, show that all values are 


comparable. The results inthe case of VA, test scores, 


2 
Table 54, give an indication that the validities under the 


CWS, and CWS, are a little better than others. However, 


i 5 
there was no evidence to suggest that any one was the best 
among all. 

2. The results of the tests on the validities of VA,_, 
test scores with RB, Tables 55-56, were unclear, and con- 


founded in the case of VA. test scores, Table 56. There 


2 
was no evidence for a single best technique. 
3. There was no difference in Tables 57 and 58 attain- 
ing statistical significance. All values were comparable. 
4. There was an indication, in Tables 59-60, that the 


validities under the CWS, were somewhat superior to the 


at 
others in the case of the validities of RA test scores with 


RB, and that results under the CWS. were likely to be the 


2 
lowest. However, no selection for the best one could be 
made. 

Since there was no evidence from these results to 
support the selection of one scoring technique over the 
others, the validities obtained under all scoring techniques 
were retained for a comparison across groups in the next 
part of this section. 


Elimination Testing Group. Table 61 shows the 
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TABLE 61 


VALIDITIES OF TEST SCORES WITH VB AND RB 
TEST SCORES IN EL GROUP 


Criteria 


Scoring 


ths Technique RB 
VA, PH sys) 

250 

VA, ahi 

.224 

RA, Sei ae 

ehh 

RA. ~360 

s309 


TABLE 62 


DIFFERENCES BETWEEN VALIDITIES UNDER ELS AND CVS 


SCORING TECHNIQUES (TABLED VALUES ARE z's) : 


Criteria 
Test 
“ 
** 
VA, 4.3263 3709 
kk 
VA, 3.2626 1.5442 
RA, 1.2884 C22 255 
RA. 1.6487 0.4949 
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validities of test scores under two scoring techniques, 


the ELS and CVS and Table 62 the results of tests of the 


2° 
difference between values under these different techniques. 

There were only two pairs of differences that 
attained statistical significance. Both were from the 
validities of VA test scores with VB and were in favour of 
the ELS technique. All other values were comparable. It 
is apparent that the scoring technigues affected the validi- 
ties of VA test scores more than those of RA. However, for 
the purpose of a comparison across groups, the validities 
of test scores under the ELS technique were retained for 
further study. This selection was the same as in the case 
of using School Achievement scores as criteria. 

A Comparison of Validities of Test Scores Across 
Groups. As the results of a comparison within each group 
with one type of criterion at a time, the same scoring 
techniques were retained from the CV and EL groups. These 
were: the CVS, and ELS. The comparisons in the CF group 
also yielded the same results for both types of criteria 
and all scoring techniques provided comparable values of 
validity. Therefore, the validities from all scoring 
techniques used in this group were retained for a comparison 
across groups. Therefore, there were seven values of validity 
for each test with the same criterion, and six criterion 
scores involved in this part of the study. Table 63 shows 
the validities of VA, and RA, test scores with all criteria. 


1 1 


and RA. test scores were not considered in this part 
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TABLE 63 


VALIDITIES OF VA] and RA] TEST SCORES UNDER 
SEVEN SCORING TECHNIQUES 


Scoring Achievement Criteria Aptitude Criteria 
el ee, Aes wt 


VA, ASE, a3 Oe 
-646 BERS. 
-638 7238 
- 636 e307 
O27 w220 
648 Ra} 
a2 ae Di) 
RA, aeted, - 496 
~429 - 488 
~429 ~475 
- 426 493 
ay 4" -494 
~427 483 
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Since test scores from the ‘second test session were used 
mainly for test-retest reliability. 

Because the tests of differences between validities 
used in this part made use of the critical values from the 
unit normal distribution and the sizes of these groups were 
very comparable (346 for the CV, 348 for the CF, and 334 for 
the EL groups), it was possible to compare these values to 
see the sequences of their ranks with each criterion. This 
arrangement was used to give an indication of whether the 
comparisons among them should be made separately with each 
type of criterion or should be combined. Table 64 was thus 
constructed using rank orders of all values with the same 
criterion. The rank orders of the validities from the CF 
group was the same because they were assumed comparable as 
a result of the study within this group. 

The results in Table 64 show a consistent pattern 
of rank orders across criteria within each type. When School 
Achievement scores were criteria, the consistent pattern held 
across criteria, but when Aptitude Test Scores (VB and RB) 
were criteria, the consistent pattern held across test 
scores. This evidence suggests that different types of 
criteria affected test validities and that an examination 
should be made within each type to gain a better view of the 
results. Tables 65 and 66 show the results of the examina- 
tion when school achievement scores served as criteria, and 
Tables 67 and 68 when VB and RB test scores were criteria. 
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TABLE 64 


RANK ORDERS OF THE VALIDITIES WITH THE SAME CRITERION 


Achievement Criteria 


LA MA SC AV 


Aptitude Criteria 


TABLE 65 


COMPARISONS OF THE VALIDITIES OF VA} TEST SCORES WHEN SCHOOL 
ACHIEVEMENT SCORES WERE CRITERIA (TABLED VALUES ARE z's) 


Criteria 


2.3869 Ie ol2 1.1541 1547836 


avs, = ows, 2.2820 1.7443 1.2197 1.8230 
CVs, - CWS, 2.5705 | 1.7443 1.2066 1.8492 
CVS, - CWS, 2.7278 1.9148 erro 2.0459 
CVS, - CWS, 2.4131 1.7443 1.2853 1.8361 
cvs, - ELS 1.4666 1.4666 0.6489 sep: 
CWS, - ELS -o.8968 | -0.2469 | -0.4939 | -0.1300 
CWS, - ELS -0.7928 | -0.2599 | -0.5589 | -0.1690 
CWS, - ELS -1.0788 | -0.2509 | -0.5489 | -0.1950 
CWS, - ELS -1.2347 | -0.4289 | -0.7148 | -0.3899 

‘ -0.9228 | -0.2599 | -0.6262 | -0.1820 


Significant at .01 level. Significant at .05 level. 


TABLE 66 


COMPARISONS OF THE VALIDITIES OF RA] TEST SCORES WHEN SCHOOL 
ACHTEVEMENT SCORES WERE CRITERIA (TABLED VALUES ARE z's) 


Criteria 

Scoring 

Techniques 
AV 

CVS, ~ CWS, -0.4721 -0.4984 -0.4852 30.7213 
CVS, - CWS, -0.4721 -0.3279 -0.3410 0.5902 
CVS, - CWS, -0.4197 =0 5377 -0.6033 -0.7738 
CVS, - CWS , 025612 -0.4984 ~0 6033 -0.7344 
CVS, ~ CWS, -0.4590 -0.4984 -0.4066 -0.6951 
CVS, - ELS =1 52070 1.5185 ~QotL55/ -0.6360 
CWS, - ELS -0.7408 -1.0268 0.3249 0.0780 
CWS., - ELS -0.7408 =|. O57 0.1820 -0.0519 
CWS. ~ EES =0. 7928 -0.9878 0.4419 0.1300 
CWS , = Hie -0.8448 ~1.0268 0.4419 0.0910 
CWS. - ELS -0.7538 = 1.0268 0.2469 0.0520 
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TABLE 67 


COMPARISONS OF THE VALIDITIES OF VA, TEST SCORES WHEN 
VB AND RB WERE CRITERIA (TABLED VALUES ARE z's) 


a 


Criteria 

Scoring 

Techniques 
RB 

CVS, - CWS Sl DerAle ff -0.3934 
CVS, - CWS -1.0492 -0.5246 
CVS, ~ CWS =), 9967 -0.0656 
CVS, ~ CWS , SW) Fag 0.1049 
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COMPARISONS OF THE VALIDITIES OF RA, TEST SCORES WHEN VB 


TABLE 68 


AND RB WERE CRITERIA (TABLED VALUES ARE z's) 


Scoring 
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groups were done by using the Fisher's z transformation 
anditest« (seesGlasseseStanley) #1970, \pt9311)% 

Validities With School Achievement. Table 65 
shows the results of tests of the differences between the 


validities of VA, test scores, Table 66 of RA, test scores. 


1. The results shown in Table 64 indicate that the 


validities of VA, test scores under the CVS. technique 


ub 1 
consistently ranked first, those under the ELS technique, 


second, and under the CWS last. But, it was observed 


T=5° 
in Table 65 that, among six pairs of the comparisons that 
attained statistical significance, five of them were the 
differences between the validities under the CVS, and CWS, _.« 
techniques with LA, favouring the CVS,- However, the 


tendency for the validities under the CVS, technique to be 


sl 


higher than those under the CWS and ELS techniques was 


1-5 
apparent throughout the first part of Table 65. It was 
evident that the validities of VA test scores under the 


CVS. technique were consistently higher than those under 


1 
other scoring techniques when School Achievement Scores 
were criteria, especially when LA was used. 

2. Table 64 shows that the results were reversed in 
the case of RA test scores. The validities under the ELS 
technique ranked first, with those under the CWS _c, second, 


and under the CVS last, with only one exception in the 


WY 
case of SC criterion. However, Table 66 shows that no 
difference attained statistical significance. It was 


evident that, though the validities under the ELS technique 
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were consistently higher than those under the CVS, 
technique, the differences were not large enough for a 
conclusion in favour of the ELS technique. 

Validities with Aptitude Test Scores. Table 67 
shows the results of tests of the differences between the 
validities of VA, test scores, and Table 68 of RA, test 
scores. 

1. The results in Table 64 indicate that, when Aptitude 
Test Scores were criteria, the patterns of rank order of the 
validities were not the same across criteria, but were similar 
across tests. When VB was a criterion, the validities under 
the CVS, technique ranked third in both VA and RA tests, but 


those under the CWS and ELS techniques interchanged their 


L=5 
ranks. When RB was a criterion, the validities under the 

ELS technique ranked third, and those under the CVS, were 
likely to rank the first, though it was not clear in the 

case of VA with RB. 

2. Considering the results of tests of the differences 
in Tables 67 and 6g, only the differences between the validi- 
ties of RA test scores under the CVS, and CWS, _« with VB 
criterion attained statistical significance. However, a 
tendency toward similar results was also seen in the case 
of the VA test with the VB criterion. This evidence 
indicates that, when VB test was a criterion, the validities 
of test scores under the CVS, technique were lower than those 
under the CWS, _._ and ELS. However, no selection between the 


CWS and ELS techniques could be made since no difference 
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between those values attained statistical significance. 

3. The differences between the validities when RB was 
a Criterion were not statistically significant. Thus, no 
method provided consistently higher validities than any other. 
There was an apparent tendency for the validities under the 


ELS technique to be the lowest. 


Summary of Results 

In this section, the focus of the study was on the 
selection of the scoring techniques which would provide 
highest validity when related to school achievement and 
aptitude. The results indicated that the scoring techniques 
affected test validity in different ways. For some tests 
andesome criteria, the conventional scoring technique), 1-e., 


the CVS, technique, provided higher validities than did the 


ut 
others; for example, in the case of VA test with School 
Achievement. But for some other cases, the ELS and CWS, _« 
techniques provided higher validities than did the conven- 
tional orie; for example, in the case of RA test with VB test. 
Although there were some consistent patterns of these values, 
the patterns did not hold over tests or types of criteria, 
and, in most cases, results were not statistically signifi- 
cant. Thus, there was no indication that the experimental 


techniques improve the validities of tests when compared 


with the conventional scoring procedure. 
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CHAPTER V 


CONCLUSIONS 


Summary 


The purpose of this study was to compare several 
methods of scoring multiple-choice tests in terms of the 
reliability and validity of test scores. Subjects were 
1028 grade nine students randomly assigned to three compar- 
ably sized groups according to test-taking method. They 
were: the conventional, the confidence, and the elimination 
methods. There were four scoring techniques used under these 
testing procedures. They were: conventional scoring, 
differential weighted scoring, confidence weighted scoring, 
and elimination scoring. Conventional scoring was used with 
the conventional and elimination methods; differential 
weighted scoring with the conventional method; confidence 
weighted scoring, which had five scoring functions, with the 
confidence method; and elimination scoring with the elimina- 
tion method. Two aptitude tests were used with these 
experimental methods. They were a vocabulary and a mathe- 
matics aptitude test. Two types of criterion scores were 
obtained: school achievement scores, and aptitude test 
scores from similar forms of the vocabulary and mathematics 
aptitude tests. 


There were two testing sessions. During the first, 
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two aptitude tests were taken under the experimental 
methods. On the second, the same two tests were repeated 
and two other aptitude tests were administered to obtain 
test scores for use as criteria. The aptitude tests as 
criteria were administered and scored with conventional 
procedure. The school achievement scores were obtained 
from schools at the end of the first term. 

The analyses of data were designed to obtain three 
statistics for test scores under each scoring technique. 
They were: the internal consistency of test scores in 
terms of alpha coefficient, the test-retest reliability, 
and the validity with school achievement, and aptitude test 
scores. Tests of the differences between values in each 
type of test statistic were made. A selection of the best 
method was made according to the results of these tests, 


where possible. 


Findings and Implications 


The ultimate goal of this study was to examine 
differences in test reliability and validity using the new 
test-taking and scoring methods, employing as a baseline, 
results obtained by administering and scoring tests under 
the conventional procedure, i.e., conventional scoring 
with conventional testing. Test-taking and scoring methods 
used in this study, except the conventional one, have been 
proposed for their ability to assess students' partial 


knowledge and eliminate guessing. The study was designed 
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to examine the following queries: 

1. Whether test scores obtained under any of the 
experimental methods are more reliable than those obtained 
under the conventional method (evidence examined included 
both the internal consistency of test scores and the stabil- 
ity of test scores over a period of time), and 

2. Whether test scores obtained under any of the 
experimental methods have higher validity for outside 
criteria measuring either (a) the same or similar traits 
or (b) school achievement, than those obtained under the 
conventional method. 

The analysis carried out and presented in Chapter 
four revealed that test scores obtained from two scoring 
functions, 1.e., functions, four.andsthree,.underathesconfis 
dence testing were more reliable than those obtained with 
the conventional method. The evidence is apparent in both 
the internal consistency of test scores in terms of the 
alpha coefficient, and test-retest reliability. The 
analysis indicated, however, that no experimental method 
provided test scores more valid for outside criteria than 
did the conventional one. Thus, the hypothesis regarding 
higher reliability of confidence weighted scores was 
confirmed, but no evidence supported the hypothesis regarding 
higher validity. 

All scoring techniques used in the present study, 
except three scoring functions under the confidence testing, 


i.e., functions three, four and five, were previously 


i oe Se ee 
y ; 


wai persone: | 
fn en 

oft Yh eaten ai beni sade 

beniside sedtt’ madd 6: 


bebu font bendmexs. .e0rle 
~lidave of Bde eexods ny 


sbraivo x01 vitblew | 
agistd rsfimie 26 sia Ss io aedtie paimesem stzests0 uk: 
4 \Jnsmeveiios IN <i 

ii 

1yesqend ni besansesxq bas BE: Weltx80° staylens ont had i 
phixope ows mov? benisidd Betoos deed sed9 botsever-su0® | 
“ino sf3 tebnw , sends os ; oe  @noisoun CP oe ‘atelenet. 7) 
dziw benistdo seodd nes’ Ss oe x siom exsW patsesd eoneb 


dtod ai tnetsggs et eonobivs eet . bodsom ‘enoksaava0s ens 
eis to emzet ni asiocse Js - ees, {entesai @ ~— 


Powe 


efit .yaifidsiles Jeoi4 297 brs stastoltiecs 6 sil 


adit a~sbny Becistdo ssomt/ ae 


_ 


borijem {atnemixzegxs on tet a 
ned? sitesixs ebtetvo 102 Bales roxq 
pnibrsest eiesniogyad adt 2 


BBW Bez0DR ee i 


eapbenpes ateoritoqyrnl sd aula : 6 te 


Ls 


if 
= 


“toss sneaerg od3 ni beet 


“cenaneg sonesiiivon ou oi 


adeuret- exsw ery Tt 


rh 


Ti 


119 
studied. A summary of the results of some former studies 
is given in Table 69. There are only two studies that 
showed significant gains over conventional testing and 
scoring, one in terms of reliability and one, validity. The 
others revealed both increased and decreased values of the 
two test characteristics. The evidence from these studies 
indicates no agreement regarding the superiority of either 
experimental method over the conventional method. 

In this study, among the three techniques previously 
used (thus excluding functions three, four and five under 
the confidence testing), the differential weighted scoring 
(DWS) is the only one technique that resulted in more reli- 
able test scores than does the conventional technique (CVS,), 
and only in terms of the internal consistency (see Table 10, 
Chapter IV). None of them provides test scores more valid 
than does the conventional one. The confidence weighted 
scoring techniques referred to in this case, are the CWS, 


arid CWS.Win this study. ‘Thus,;. the results) of the present 


2 
study on these three scoring techniques previously used, 
combined with those former results can lead to a fairly 
definite conclusion. The three scoring methods, i.e., the 
DWS, ELS, and the CWS with the techniques as specified in 
Table 69, cannot be expected to improve both the reliability 
and validity of test scores. Further study on these scoring 
techniques seems unwarranted. 


Three new scoring functions used under the confidence 


testing method are introduced in this study. These are: 
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(ay) CWS., the logarithmic function complementary to that 
used in previous studies (see Hambleton et al., 1970); 


(2) CWS, the quadratic function; and (3) CWS the normal- 


5° 


ized scoring function (see Chapter III). The CWS. and CWS 


3 


functions are based on the increasing increment scoring 


& 


model, and the CWS, function, the normalized increment 
scoring model (see Chapter III). These two scoring models 
have been presented for the first time. The results of the 
analysis show that the CWS, and CWS. under the confidence 
testing condition provide test scores more reliable than 
does the conventional method, both in terms of the internal 
consistency and test-retest reliability. None of these 
three functions, in addition to those previously used, 
results in more valid test scores than conventional scores. 
In fact, for some criteria, the validities obtained from 
these functions are significantly lower than those from the 
conventional method (see Table 65, Chapter IV). Though the 
reliabilities from the CWS, and CWS, functions under the 
confidence testing method are higher than those from the 
conventional method, this fact does not necessarily point 
to the superiority of these techniques, since there is no 
increased validity from either experimental technique. The 
use of one or the other of these two scoring functions in 
normal testing practice may, therefore, not be worth the 
effort. 


There are two important studies reported in the 


literature in which the confidence testing approach is 
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compared with the conventional method. These are the 
studies by Hambleton et al. (1970) and Hopkins et al. 
(1973). These studies are concerned with both reliability 
and validity. Hambleton et al. (1970) used a logarithmic 
function (which corresponds to the CWS. function used in 
the present study) with 100 points of confidence. They 
found that the use of confidence weighted scoring resulted 
in decreased reliability but increased validity. Since 
the increment in test validity did not attain statistical 
Significance, the authors did not take the result as 
definitive. They pointed out that three weak points 
contributed to the insignificant increment: (1) the 
small group size (211 for all three groups), (2) the longer 
testing time required ean the confidence testing, and (3) 
the too low difficulty level of test for the subjects 
employed. 

The results of the study by Hambleton et al. (1970) 
led\to another study by Hopkins et. al. (1973). These 
authors postulated that: 

esanioulfhthetincreaseyinftrebiability ds,the result 
of a gambling response style, it is conceivable that 
validity could actually decrease even though reli- 
ability is increased (Hopkins et al., 1973, p. 138). 

Hopkins et al. (1973) instructed students to place 
an H, M, or L beside their response, indicating high, medium, 
or low confidence in the answer given. No scoring formula 


was used. The item scores depended on the level of confi- 


dence given and on the correctness of the responses. They 
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found that the use of such a technique resulted in 
increased reliability but decreased validity. No difference 
attained statistical significance. The conclusions from 
these results are that: 

- .« ». the added reliable variance often observed in 

confidence-weighting studies may be irrelevant response 

style variance and does not increase validity, in fact, 

it may actually diminish validity (Hopkins et al., 

Lofaren. £1:40,)% eo four 

The results of the present study in the case of the 
CWS , and CWS , functions under the confidence testing condi- 
tion tend to support this conclusion. The use of these two 
scoring functions resulted in increased reliability but 
decreased validity. Although the scoring techniques used 
in the two studies are not identical, it is apparent that 
the results of the present study confirm those results 
foundebytHopkinsmetenalsy (1973). 
The fact that an increase in test reliability can 

be accompanied by a decrease in validity is not new. It 
is well known that, when we are trying to make test scores 
more reliable in terms of internal consistency, we are 
increasing the homogeneity of test items or, in this case, 
test scores. It is also known that increased heterogeneity 
of a test leads to increased opportunity for raising the 
validity (Magnusson, 1967, pp. 179-194). Therefore, when 
we are trying to make test scores more reliable in terms of 
internal consistency, i.e., the alpha coefficient, and valid 


at the same time, we may be undertaking an impossible task. 


However, this task may be accomplished if we at the same time 
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make the criterion more homogeneous and ensure that it 
Measures exactly the same factor as the experimental test 
does. In this study, the criteria were not modified in 
any way. The school achievement scores were to a large 
extent multi-factor measurements. Because we had multi- 
factor criteria and more homogeneous test scores, i.e., 
test scores from the CWS , and CWS, functions, we could not 
expect a higher correlation between them than when test 
scores were less homogeneous, i.e., test scores from the 
conventional method. 

The aptitude test criteria, on the other hand, were 
Originally constructed to measure specific factors. Items 
in these tests were more homogeneous than those in the 
school achievement tests. But, because we did not try to 
improve their homogeneity as we did with the experimental 
tests, aptitude test scores as criteria, i.e., test scores 
from VB and RB tests, were relatively less homogeneous than 
the experimental test scores, i.e., test scores from VA and 
RA tests under the CWS, and CWS , functions. In this case 
it could be expected that the validity of the more reliable 
test scores, i.e., test scores from the CWS, and CWS, func= 
tions, would not be much higher than those of the less 
reliable test scores, i.e., test scores from the conven- 
tional method, when these aptitude test scores, i.e., test 
scores from VB and RB tests, were criteria. 


Evidence to support the above discussion is seen in 
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Table 70. In this table, the validities of the vocabulary 
test (VA) with Language Arts (LA) and another vocabulary 
test (VB) are shown. It is apparent that, when school 
achievement was criterion, the validity of test scores 
under the conventional method was significantly higher 
than those under the confidence method. When an aptitude 
test was criterion, the results reversed, although no 
differences attained statistical significance. The results 
shown in Table 71, for the mathematics aptitude test (RA) 
are not as clear. The validity with Mathematics (MA) under 
the conventional method tended to be lower than others, and 
when another mathematics aptitude test (RB) was the criterion, 
all values were comparable. Since the differences of the 
validities with LA did not attain statistical significance, 
the results neither contradict nor support the above con- 
clusions. The case of the mathematics aptitude test (RA) 
is likely to be a result of the test's nature, since it is 
evident that, when a comparison among groups was made, both 
in the case of test-retest reliability and the validity 
(Chapter III), the differences between values from the 
mathematics aptitude test (RA) were less marked 
than those from the vocabulary test (VA). This is likely 
an indication that the test-taking and scoring methods 
have less effect on the mathematics aptitude test than on 
the vocabulary test. 


The ultimate goal of the present study, as well as 
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TABLE 70 


COMPARISONS OF THE VALIDITIES OF VA TEST SCORES UNDER THE 
CONVENTIONAL AND. CONFIDENCE METHODS (VALUES ARE z's) 


Criteria 


* 
Significantly higher than the other two values with LA. 


TABLE 71 


COMPARISONS OF THE VALIDITIES OF RA TEST SCORES UNDER THE 
CONVENTIONAL AND CONFIDENCE METHODS (VALUES ARE z's) 


Criteria 
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of the others reviewed in Chapter II, was to find new 
test-taking and scoring techniques that can eliminate two 
crucial disadvantages inherent in the use of the conven- 
tional method. These are: (1) the inability to assess 
partial knowledge and (2) the encouragement of guessing. 

It has been suggested by several test specialists that the 
test-taking and scoring techniques used in this study can 
eliminate these two disadvantages (Coombs et al., 1956; 
dewPinnetti, 1965; Hopkins” et. -al., 1973 -oShutord set. al. , 
1966; Wang & Stanley, 1970). But, since the new techniques 
do not appear to result in increased validity of test 
scores-which is the most important characteristic of a 
test--the use of these techniques in practical testing 

may not be worth the effort. It is apparent that, in order 
to overcome the disadvantages of the conventional technique, 
one should look for new approaches rather than pursuing 


these methods. 


Suggestions for Further Research 

Because this study was confined to one specific 
level of subjects and only two aptitude tests, a study on 
the same problem with other levels of subjects and other 
tests is recommended. The results of such a study combined 
with the present one will, no doubt, make the findings more 
meaningful and the implications more comprehensive. 

In this study, no effort was made to find the 
influence of nonintellectual factors on subjects' test 


scores. Since there is possibility that subjects' 
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performance is affected by the influences of these factors, 
there should be a study on this problem. The results of 


such a study may lead to improvements of the techniques 


used in the present investigation. 
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INSTRUCTIONS TO EXAMINERS 


Tell the students that these tests are part of a 
study On improving methods for marking multiple- 
choice exams. Emphasize that it is important that 
they answer all questions both quickly and 
accurately. 


Have the students clear their desks of everything 
except a pencil. (If no pencil is available, a 
pen is OK as a last resort). 


Distribute the answer sheets and the VA tests 
according to the scheme outlined below (or your 
version of it). 


Row 1 Row 2 Row 3 Row 4 Row 5 
+r >> > > > Sie oie ale aia 

Oy A + +D 
abs ,0 lng ‘oO a0 
n° oO n” + + 
+ A + 29 Bye: 
Oy o Oy 40 ,o 
Co: te G Of>+ +> tO v4) 


Have the students put their names on the answer 
sheets and under Faculty or School write the name 
of their school and their room number. 


Read the general instructions to the class, have 
them read the instructions on their test booklets, 
and then check the class for difficulties in under- 
standing the instructions. 


Begin and time the test. Total time = 12 min. 
Have one student pick up the VA tests while you 
distribute the RA tests according to the same 


sequence as used previously. 


Have the students read the instructions on the RA 
booklets. 


Begin and time the test. Total time = 15 min. 


Collect the RA tests and answer sheets, thank the 
students and teacher, and leave. 
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INSTRUCTIONS FOR STUDENTS 


THESE INSTRUCTIONS ARE TO BE READ BY THE EXAMINER AFTER 
THE STUDENTS HAVE BEEN GIVEN THE VA TEST BOOKLETS AND 
ANSWER SHEETS 


ars 


The multiple-choice tests which you are going to 
write today will be answered in three different ways. 
One method is the same as you usually use, but the 
other two are different. Each of you will use only 
one method. 


The instructions for answering the test are already 
printed on the cover of each test booklet. Read 
these instructions carefully and make sure that 

you understand them before you start writing. If 
after reading the instructions, you have any ques- 
tions, please ask for help. I will explain the 
method to you. 


I will give you about 5 minutes to read the test 
instructions and ask questions. Do not open the 
test booklet until I tell you to do so. Everyone 
will start the test at the same time. 


Now all of you have the test booklets and answer 
sheets--begin reading the instructions. Be sure 
you understand the instructions before you start 
writing the test. 


WHEN ALL STUDENTS HAVE READ THE INSTRUCTIONS AND CLEARED 
UP ANY DIFFICULTIES, CONTINUE AS FOLLOWS: 


oe 


Is everyone ready? Begin! 


12 MINUTES LATER: 


6. 


Close your booklets please! 


REPEAT STEPS 4, 5, and 6 FOR THE SECOND TEST. REMEMBER 
THE SECOND TEST (RA) IS 15 MINUTES LONG. 
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VOCABULARY TEST -- VA-CV. 


This is a test of your knowledge of word meanings. Look at the 
example below. One of the five numbered words has the same meaning 
or nearly the same meaning as the word above the numbered words. In 
this example the right answer has already been marked. This was done 
by placing a black mark between the guidelines on the answer sheet 
as shown, by using an HB pencil. 


i. jovial i. you heve 2ore Fenris 


j-refreshing 
e-scare 
4-thickset 
4-wise 

Do JOLLY 


To mark an answer, decide first which is the best answer. Then, 
on the answer sheet, find the row of the answer numbered the same as 
the question. Make a black mark between the guidelines for the best 
answer. Make only one mark for each question. 


Your score will be the number of the questions correctly 
answered. 


You will have 12 minutes to answer all 25 questions in this 
test. 


DO NOT MARK IN THIS TEST BOOKLET. If you have any questions, 
please ask us now. 


DO NOT TURN THIS PAGE UNTIL ASKED TO DO SO. 
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VOCABULARY TEST -- VA-CF. 


This is a test of your knowledge of word meanings. There are 
25 questions to be answered. Each question has one word given above 
and 5 numbered words given below. One of them has the same meaning 
Or nearly the same meaning as the word above and is the correct 
answer. Your task is not to determine the correct answer, but to 


ee ee ee 


If you are sure that one word, say the 1st. word, is correct, you 
may give 10 points to that word and give 0 to all others. But if 
you are not sure about any of them, you may give only a major 
portion of 10 points to one word which you have more confidence in 


than the others, and then distribute the rest of the 10 points to 
some other words. 


Look at the example below. In this example, the 5 th. word, 
joutly, is the correct word, so all other words are incorrect. 


Yee JOVIAL 
l-refreshing 
e-scare 
4-thickset 
4-wise 
5-jolly 


Ase if you are sure that the 5th. word, jolly, is correct, vou 
may give 10 points of confidence to that word by writing '10' under 
the corresponding number on the answer sheet, and then write 'O! 
under all other numbers, as shown below - 


ive 1 2 d A ? 
re) O 0 O 10 


Bu Jf. you think that the Sth, word is: correct, butsyoucare snot 
very sure and you still think that the end. word, scare, might be 
correct, you may give 7 points to the 5th. and 3 points to the 2nd. 
words by writing '7' and '3' under the corresponding numbers on the 
answer sheet and giving other numbers '0', as shown below - 


a aes 1 2 es 4 9D 
EE ews Bee 


or, you may give 6 puints to the 5th. word, e to the end. 
and the 3rd. words on the answer sheet as in this example - 


+ 1 2 4 ky \ 
i cabo ae ors hots 
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or, give them as in this example - 


i. Se et ee eee 
ee ee 


You are free to give any number of points to each answer in 


an item, but you have to make sure that the points you give to all 
words add up to 10; no more; no less! 


Remember that you have to answer all questions on the answer 
Sheet by writing numbers of points under the corresponding number 
of words, as shown in the examples. 


Your score for an item will be the number of points you give 
to the correct answer. If you give 10 to the correct answer you 
receive a.score of J0, if you give 7.you receive 7..and Lt. you. give 
O you receive 0. So your. score for an item will vary from.0 to. 10. 


It is important for you to know that you will receive a higher 
score on the test if you indicate honestly your degree of confidence 
in the correctness of each choice. 


You will have 12 minutes to answer all 25 questions in this 
test. 


DO NOT MARK IN THIS TEST BOOKLET. If you have any questions, 
please ask us now. 


DO NOT TURN THIS PAGE UNTIL ASKED TO DO SO. 
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VOCABULARY TEST -- VA-EL. 


This is a test of your knowledge of word meanings. There are 
25. questions to be answered. Each question has one word piven above 
and fivenumbered words given below. One of the five numbered words 
has the same meaning or nearly the same meaning as the word above 
and is the correct answer. Your task is not to determine the correct 


answer, but to determine, among the five numbered words, which of 
them are incorrect. ioe 


Actually, there are 4 numbered words which are incorrect. You 
Should choose only words which you are sure are incorrect. It is not 
necessary that you find all 4 incorrect words. You may choose only 
one or two or three words which you are sure are wrong. 


Look at the example below. In this example, the 5th. word, jolly, 
is the correct answer, so all other words are incorrect. If you can 
find all of them as in this example, then, on the answer sheet on the 
row of the answer numbered the same as the question, you cross out 
all numbers except number 5. 


i. jovial ion ea 3S ab Oe ees 


l-refreshing 
e-scare 
4-thickset 
Loewise 
a-golly, 


Your score for an item will be the number of incorrect words 
you cross out. You will score 4 points from the answer shown above. 
If; *by mistake, you also cross out ‘the correct word, you will lose 
iL” pointspfor that »word 


Suppose that you cross out numbers2,3,4 and 5 on your answer 
sheet for the above question. You will score 5 points from words 
numbered 2,45 and 4, but lose 4 points for number 5, so your score 
for this question would be -1. But if you cross out only numbers 
eeseandyi,. your score will. be 3 instead of, =i. or, 4. 


So, be careful to cross out only words which you are really 
sure that they are incorrect. DO NOT GUESS. 


You will have 12 minutes to answer all 25 questions in 
this test. 


DO NOT MARK IN THIS TEST BOOKLET. If you have any questions, 
please ask us now 


DO NOT TURN THIS PAGE UNTIL ASKED TO DO SO. 
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MATHEMATICS APTITUDE TEST -- RA-CV. 


In this test you will be asked to solve some problems in mathe- 
matics. Solve each problem and mark your answer on the answer sheet 
as shown by the following example - 


i. How many pencils can you buy for 50 cents at the rate 
Ofec2 for 5ucentss 


1-10 
2-20 


ig lease 3 4 oo, 


To mark your answer, decide first which is the right answer. 
Then, on the answer sheet, find the row of the answer numbered the 
Same as the question. Make a black mark, with an HB pencil, between 
the guidelines for the right answer. Make only one mark for each 
question. 


Your score will be the number of the questions correctly 
answered. 


You will have 15 minutes to answer all 15 questions in 
this test. 


DO NOT MARK IN THIS TEST BOOKLET. If you have any questions, 
please ask us now. 


DO NOT TURN THIS PAGE UNTIL ASKED TO DO SO. 
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MATHEMATICS APTITUDE TEST -- RA-CF. 


In this test you will be asked to solve some problems in mathe- 
matics, There are 15 problems to be solved. Each problem has 5 
numbered answers. One of them is correct, the other 4 are incorrect. 
Your task is not to determine the correct answer, but to indicate 


Se Ee ee 


are sure that one answer, say the 1st. answer, is correct, you may 
give 10 points to that answer and give 0 to all others. But if you 
are not sure about any of them, you may give only a major portion 
Of 10 points to one answer which you have more confidence in than 
the others, and then distribute the rest of the 10 points to some 
Other answers. 


Look at the example below. In this example, the 2nd. answer, 20, 
is the correct answer, so all others are incorrect. 


i. How many pencils can you buy for 50 cents at the rate 
of e2=tored: Cents? 


1-10 
2-20 
pe? 
4-100 
BAe) 


A. If you are sure that the 2nd. answer is correct, you may 
give 10 points of confidence to that answer by writing '10' under 
the corresponding number on the answer sheet, and then write 'O! 
under all other numbers, as shown below - 


1 1 2 3 ky 5 
p OR BS a1Ou pe Oe Ue ei &, 


B. If you think that the 2nd. answer is correct, but you are 
not very sure and you still think that the 35rd. answer, 25, might 
be correct, you may give 7 points to the end. and 3 points to the 
3rde answers by writing '7' and '3' under the corresponding numbers 
on the answer sheet and giving other numbers '0O', as shown below - 


= 1 Fo 3 My D 


or, you may give 6 points to the end. answer, 2 to the 3rd. 
and the 4th. answers on the answer sheet as in this example - 
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You are free to give any number of points to each answer in 
an item, but you have to make sure that the points you give to all 
answers add up to 10; no more; no less} 


Remember that you have to answer all questions on the answer 
Sheet by writing numbers of points under the corresponding number 
of answers, as shown in the examples. 


Your score for an item will be the number of points you give 
to the correct answer. If you give 10 to the correct answer you 
receive a score of 10, if you give 7 you receive 7, and if you give 
O you receive O. So your score for an item will vary from O to 10. 


It is important for you to know that you will receive a higher 
score on the test if you indicate honestly your degree of confidence 
in the correctness of each choice. 


You will have J5. minutes to answer all 15 questions in this 
test. 


DO NOT MARK IN THIS TEST BOOKLET. If you have any questions, 
please ask us now. 


bDO NOT TURN THIS PAGE UNTIL ASKED TO DO SO. 
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MATHEMATICS APTITUDE TEST -- RA-EL. 


In this test you will be asked to solve some problems in mathe- 
matics. There are 15 problems to be solved. Each problem has five 
numbered answers. One of them is correct, the other four are incorrect. 
Your task is not to determine the correct answer, but to determine, 
among the five answers, which of them are incorrect. 


Actually there are 4 numbered answers which are incorrect. You 
should choose answers which you are sure are incorrect. It is not 
necessary that you find all 4 incorrect answers. You may choose only 
one or two or three answers which you are sure are wrong. 


Look at the example below. In this example, the 2nd. answer, 20, 
is the correct answer, so all others are incorrect. If you can find 
all of them as in this example, then, gn the answer sheet on the row 
of the answer numbered the same as the question, you cross out all 
numbers except number 2. 


i. How many pencils can you buy for 50 cents at the rate 
of~s Tor Scents? 
1-10 
2-20 
eas) 
4-100 
eae 


AL hawdx wenotebsco anger EC 


Your score for an item will be the number of incorrect answers 
you cross out. You will score 4 points from the answer shown above. 
If, by mistake, you also cross out the correct answer, you will lose 
4 points for that answer. 


Suppose that you cross out numbers 1,2,3 and 4, you will score 
3 points for numbers 1,3 and 4, but lose 4 points for crossing out 
number 2, so your score for this question is -1. But if you cross 
out only numbers 1,3 and 4, your score will be 3 instaed of -1 or 4. 


So, be careful to cross out only answers which you are really 
sure that they are incorrect. DO NOT GUESS. 


You will have 15 minutes to answer all 15 questions in this 
test. 


DO NOT MARK IN THIS TEST BOOKLET. If you have any questions, 
please ask us now. 


DO NOT TURN THIS PAGE UNTIL ASKED TO DO SO. 
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VOCABULARY TEST -- VB. 


This is a test of your knowledge of word meanings. Look at the 
example below. One of the four numbered words has the same meaning 
Or nearly the same meaning as the word above the numbered words. In 
this example, the right answer has already been marked. This was 
done by placing a black mark between the guidelines On the answer 
Sheet as shown, by using an HB pencil. 


i. attempt ae 1 a iD 4 
saa 
j-run 
e-hate 


Dalry. 
he-stop 


To mark an answer, decide first which is the best answer. Then, 
on the answer sheet, find the row of the answer numbered the same 
as the question. Make a black mark between the guidelines for the best 
answer. Make only one mark for each question. 

Your score will be the number of the questions correctly 
answered 

You will have 8 minutes to answer all 30 questions in this 
test. 

DO NOT MARK IN THIS TEST BOOKLET. If you have any questions, 
please ask us now. 


DO NOT TURN THIS PAGE UNTIL ASKED TO DO SO. 
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MATHEMATICS APTITUDE TEST -- RB. 


In this test you will be asked to solve some problems in mathe- 
matics. Solve each problem and mark your answer on the answer sheet 
as shown by the following example - 


i. How many pencils can you buy for 50 cents at the rate 
Of 2 for 5acents? 


1-10 
2-20 


ds, 1 2 5 4 A 


To mark an answer, decide first which is the right answer. Then, 
on the answer sheet, find the row of the answer numbered the same as 
the question. Make a black mark, with an HB pencil, between the 
guidelines for the right answer. Make only one mark for each question. 


Your score will be the number of the questions correctly 
answered. 


You will have 10 minutes to answer all 15 questions in 
this test. 


DO NOT MARK IN THIS TEST BOOKLE'!. If you have any questions, 
please ask us now. 


DO NOT TURN THIS PAGE UNTIL ASKED TO DO SO. 
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1. Vocabulary Test I.D. number 
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2. Mathematics Aptitude Test 

RA-CF. 

Example * Be sure that all points distributed add up to 10. 
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TABLE 72 


NUMBER OF UNIVERSITY STUDENTS RATING 
FOR OPTION WEIGHTS 


ee ee eee 


Class Number 
Ed. Psy. 487 8 
Ed. Psy. 489 Be 
Ed. Psy. 504 uh 
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OPTION WEIGHTS FOR VOCABULARY TEST: FORM A 


Option 
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TABLE 74 


OPTION WEIGHTS FOR MATHEMATICS 
APTITUDE TEST; FORM A 
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APPENDIX B 


NORMS FOR VOCABULARY TEST: FORM B AND 


MATHEMATICS APTITUDE TEST: FORM B 
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TABLE 75 


NORMS FOR VOCABULARY TEST: FORM B 


Raw Score Percentile Score T-Score 
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TABLE 75 (continued) 


Number of students (grade 9) 2 EO28 
Number of test items : 30 
Number of item responses : 4 
Mean of the total group : LOa2o 
Standard deviation : 4.05 


Prepared by Wanlop Kansup 
Department of Ed. Psy. 
The University of Alberta 
March,21973. 
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NORMS FOR MATHEMATICS APTITUDE TEST. FORM B 
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Number of students (grade 9) ; 1028 
Number of test items : vie 
Number of item responses : 5 
Mean of the total group : Dae 
Standard deviation : 2 Oa) 


Prepared by Wanlop Kansup 
Department of Ed. Psy. 
The University of Alberta 
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INTERCORRELATIONS, MEANS AND STANDARD DEVIATIONS 


De ye) 


7's 


ae er 


159 


€6e°0O = 

79S°0 ese*o = 

s6P°o ECE-O 409°0 = 

SL ema 0L9°0 boro Seer 0 ap 

ASAD) 68S°0 9FE°O tS Sas) 089°0 S 

187°O be °O 808°0 L9b°0 9Sc°0 ELZ=O = 

O6£°0 st Bio) 8Pv°O L08°0 c8t°O L9T°O o€s*o 

O9E£°O 6€9°0 98€°0 Mass 18) €v6°O 0z9°0 CISTO 

CALS 0) 7es°O 6SE°O SvE°O COS 0 s96°0 LEG=O 

i: z T T T G T T c 

qu aA SAD- Wa SAD- Wa SAD- WA SAOD- WA smMa- Wa 


dnOW ONILSAL TWNOITZNAANOD NI SaxOOS ONOWY SNOILYTAYNOOUaINI 


LE adv 


€0c°0O 


yet*o 


T 
SMa- Wu 


0z9°0 


6 
SMa- WA 


is 
SMa- WA 


AW 

os 

WwW 

WT 

qu 

aA 
T C 

SAD- Wa 
E if 

SAD- Wa 
T A 

SAD~ WA 
T T 

SAD= VA 
(A 

smd- Wu 
ue 

smMa- Wa 
G 

SMa- WA 
ft 

SsMa- WA 


— 


= um am we. 
te fe 4 


i 
_f 


‘4 


be Gg 
<a an 


a 
4 
. 


De 


> 
> 
ay 
6 
> _ 
7 - 4 
4 : va 
_ i - 
eS i 
I a oo Ff 
A 2 : . 
2 
r 
, - y > ‘ 4 
fo * ’ 
@ 7: t mk ” 
G 7 ; i 
; > - on 8 
4 7 ; 
7 


— a 7 


160 


*TeuToOSep ee SENTeA TTY 370N 


- AW 
$6. = os 
Ste Gl eS vw 
sek 809 S6S) - wT 
S906q Tee “GIS tGE = au 
oly Lop 7%SE pee SPE - aa 

Ssmo- 
Zph 6th 8b SLZ ~@S vee = SsMo- 
6ep ver 6Lb TLZ GES BE Sié, = Psmo- 
Zp Seb €8b LZ LES O6E S86 666 - SSMmo- 
cep %b 69 ZLZ ZTS O6€ p66 656 TL6 - €SMO-_ 
Tbh Teh 6Lb blz OFS CEE 166 986 €66 766 - smo- wa 
pop 2b 66b €0£ f8b LzP 499 pl9 SL9 0S9 899 = Psmo- 
Lov €€h 66h LEZ HEH 2b ys9 889 £89 ze9 799 £16 ~ Psmo- 
69p €€h 70S OOF f6F IF 099 489 89 O09 999 S86 866 ~- 5SMo- 
esp LIb 68h POE SLb 62h 099 £99 S99 89 199 pee LS6 TLE ~ €SMo~_ 
99h 9%b 66b POE 88b 62h 599 089 619 8b9 699 66 86 766 266 — smo- twa 
6Lb 9Sb SBE 6Tb ZIE 219 coh Shh Sbh Sb ebb vSb ~%Sb 9Sb TSP SSP = PsMou 
Osh Sbp LLE PLE ETE O€9 Eth BSh Sh 60h Zep lb SSb @Sh 9b OVP SSE = UsMo- 
Zop ESh CBE SBE PIE LH9 Of&h SSh zsh LIb Lev Sf 9Sr Sh E€b Phy tle LEG SSMo- 
L6ev €Lb O6€ OSb STE 219 pp 6&b Obb LED Thh S8bb Ebb Lob 6rv OSb £86 OT6 EEG - SMO 
98h 89h 7Z6E 9% SEE ELI bep LohaiSSh peheSth PSh or CORN GISY G6S% 9 1T66,"89q 086 BIG ~ SMO-CWA 
Lop 76E ZEE PEE BZE Bb Oop PIP z@Ip Ses €0P  OSb SSh LSb OSh wSh €SL PSL 6SL fel OSL - pSMom 
€6€ 98f OZE ZTE S6z LZ9 ce pip Loy cof one  €2¥ esr 6rr Ocr Per “OSL CEL <C6L BOL SOL 136 77 Psmo- 
GOpel6e CEGqeCe LOG, C9 AEE TebesSIP SlGmiGe SER 09) ESi- 4D Shing LSlLebol TEL Bb OLL tee ete Loom 7 5 SMO~ 
80b 96 ZEE ZHE BEE 8BE9 co aimee ccec "eOy, Gb? OSr. C6) esr cer TEL STL (bel SUL Seis 2860 ts oes SMO- 
Oth OO E£€ OEE G6ZE 9b9 cae ehumee GoteLOr  Gprmtoy TORShy SStparg! OSL qaaly “Th eakede Folmeecss <UGeePo. ESS= @ Tgmo- Wa 

ee ee 
Si Se ee ee ee SY Ae a eee: a ee ee 
GMD SMD .§M3 SMD SMD SMO. SMD SMO, SMO SMO,” y. SMD) SMO. 9 SMIegSMD.9SMDnq) (SMD SHQ)« SMO SBD) SHO 

Av OS WW WT gu aA z T Z T 


Wa Wu WA WA 


8 SS eee 


dNOW ONILSAL AONACIANOO NI SHYOOS ONOWY SNOLLY TA2HNOOUaLNI 


8L ATaVL 


wr sett pegs ae | 
st asen — 


i -~ 


a 


7 
: > elemizeh acs seciav 114 scot 
7 4 


161 


LOE*O = 

69£°0 pse’°oO 7 

L8e°O EIESO 68S°0 ZZ 

~c2°0 €8s°0 6S72°0 pSc°O = 

SEq—0 9~7S°0O LES 0 Svc°O £99°0 

O9€°0 T8e°O 9E6—0 LLS*O 80z°*0 

€s8e°o pse°o £€LS°0 8t6°0 Lees0 

cLe.0 £99°0 GLE=O 7ZE°O Lz8°O 

s8c°o GEDAO SOE*O LTE ZO 709°0 
Z z Z if A 6 

au aA SAD- Wa SAD- Wa SAD- WA 


Z 


1 
SAD- WA 


STa- va 


Tve°o 


6SE°O 


S'la- 


T 


wa 


PESO 


Gs 
sTa- WA 


T 
Sia- VA 


AW 
os 
WW 
WI 
qa 
aA 
é 
SAD- Wa 
is 
SAD- Wu 
z 
SAD- WA 
T 
SAD- WA 
cE 
Sia- wa 
a5 
Sta- wu 
4 
sTta- WA 
xc 
STa- WA 


dnOwW ONILSAL NOTLYNIWITA NI SaYOOS SNOWY SNOILYTFxdxOOUaLNTI 


6L aTavh 


1 


oe cal 


la 7 oe Y 
7 - wall 


G2 


TABLE 80 


MEANS AND STANDARD DEVIATIONS OF TEST SCORES UNDER 
CONVENTIONAL TESTING AND SCORING METHODS 
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